php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #64667 mb_detect_encoding problem
Submitted: 2013-04-18 14:08 UTC Modified: 2015-07-16 17:55 UTC
Votes:6
Avg. Score:4.8 ± 0.4
Reproduced:6 of 6 (100.0%)
Same Version:2 (33.3%)
Same OS:1 (16.7%)
From: jisgro at teliae dot fr Assigned: cmb (profile)
Status: Not a bug Package: mbstring related
PHP Version: 5.3.24 OS: Debian Edge
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: jisgro at teliae dot fr
New email:
PHP Version: OS:

 

 [2013-04-18 14:08 UTC] jisgro at teliae dot fr
Description:
------------
php 5.3.3
We open a file with ANSII encoding, we set the encoding with the "iconv_set_encoding("internal_encoding", "UTF-8");" function to UTF8
the mb_detect_encoding return before and after the encoding : Format : ISO-8859-1 

The function is in the test script, it returns : 

Format : ISO-8859-1 mystere
ééé ééé é éé à à à à à , <-> , ��� ��� � �� � � � � � ,
Format : ISO-8859-1 

Test script:
---------------
function convertirFichierEnUTF8($sNomFichier){
  $sContenuFichier = file_get_contents($sNomFichier);
  if($sContenuFichier == ''){//cas vide et cas erreur de lecture
    return;
  }

  $tabFormatsReconnus = array(
     'ASCII'
    ,'ISO-8859-1'
    ,'ISO-8859-2'
    ,'ISO-8859-15'
    ,'UTF-8'
    ,'UTF-16'
    ,'UTF-32'
    ,'Windows-1251'
    ,'Windows-1252'
  );
  $sFormat = mb_detect_encoding($sContenuFichier, $tabFormatsReconnus, true);
  //echo $sNomFichier."\n";
  echo "Format : ".$sFormat."\n";

  if($sFormat === false){
    CLog::trace('Erreur encodage du fichier '.$sNomFichier.' inconnu', 'Conversion fichier', 'Erreur détection encodage', 0,
        CLog::INIVEAU_ERREUR_CRITIQUE, CConfig::$sEmail_Trace_Erreur);
    return;
  }

  //Les formats suivants n'ont pas besoin de conversion
  if(in_array($sFormat, array('UTF-8', 'ASCII'))){
    return;
  }

  iconv_set_encoding("internal_encoding", "UTF-8");
  //iconv_set_encoding("output_encoding", "UTF-8");
  $sNouveauContenu = iconv($sFormat, 'UTF-8', $sContenuFichier);

  //Si la conversion a eu un problème
  if($sNouveauContenu === ''){
    CLog::trace('Erreur à la conversion en UTF8 du fichier '.$sNomFichier, 'Conversion fichier', 'Erreur conversion UTF8', 0,
       CLog::INIVEAU_ERREUR);
    $sNouveauContenu = iconv($sFormat, 'UTF-8//IGNORE', $sContenuFichier);
    CreeRepSiNonExiste(CConfig::$sRepertoire_log, 'erreursConversionFichiers');
    file_put_contents(CConfig::$sRepertoire_log.'erreursConversionFichiers/'.basename($sNomFichier), $sContenuFichier);
  }

  //On sauvegarde le résultat de la conversion
  file_put_contents($sNomFichier, $sNouveauContenu);
  echo ($sContenuFichier === $sContenuFichier ? 'aie aie aie c pareil':'mystere' );
  ttt($sNouveauContenu,'<->',$sNouveauContenu);
  $sFormat = mb_detect_encoding($sNouveauContenu, $tabFormatsReconnus, true);
  //echo $sNomFichier."\n";
  echo "Format : ".$sFormat."\n";
}

Expected result:
----------------
return format in UTF8

Actual result:
--------------
Format : ISO-8859-1 

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-04-18 14:53 UTC] jisgro at teliae dot fr
sorry, the Test script is not very simple, here is a "light version":

function convertFileInUTF8($sFileName){
  $sFileContent = file_get_contents($sFileName);

  $tabKnownEncoding = array(
      'ASCII'
      ,'ISO-8859-1'
      ,'ISO-8859-2'
      ,'ISO-8859-15'
      ,'UTF-8'
      ,'UTF-16'
      ,'UTF-32'
      ,'Windows-1251'
      ,'Windows-1252'
  );
  $sFormat = mb_detect_encoding($sFileContent, $tabKnownEncoding, true);

  echo "Format : ".$sFormat."\n";

  iconv_set_encoding("internal_encoding", "UTF-8");

  $sNewContent = iconv($sFormat, 'UTF-8', $sFileContent);

  //Save
  file_put_contents($sFileName, $sNewContent);

  ttt($sFileContent,'<->',$sNewContent);
  $sFormat = mb_detect_encoding($sNewContent, $tabKnownEncoding, true);

  echo "Format : ".$sFormat."\n";
 [2013-05-01 01:17 UTC] mail+php at requinix dot net
In general, and in your circumstance, mb_detect_encoding() will return the first 
encoding that matches. The requirements for "ASCII" are that the bytes are all 
<0x80 and your file won't match that. Next is "ISO-8859-1" which doesn't really 
have any requirements at all. And that's the problem: it will always succeed.

The array should be arranged with most exclusive (harder to validate) encodings 
first and most permissive (easier to validate) last. I would start testing with:

[UTF-32, UTF-16, UTF-8, ASCII, ???]

The problem is what to fall back to. ISO 8859-1, -2, and -15 will always 
succeed, and Windows-1251 and -1252 will only succeed if the entire string 
consists of high-byte characters in a certain range. (Why do those two work like 
that? No clue.)

So make a choice: if a string is neither UTF-* nor simple ASCII what do you 
think it probably is? You're writing code in French so I'm going to guess ISO 
8859-15.

[UTF-32, UTF-16, UTF-8, ASCII, ISO-8859-15]

If you want to go beyond that then you can do some rudimentary character 
analysis: some byte combinations may make sense in one encoding but not in 
another. Example: about half of \xA0-\xAF bytes in ISO 8859-1/-15 are symbols 
but are characters in ISO 8859-2.
 [2014-04-23 13:37 UTC] scy-bugs-php at scy dot name
In case anybody is curious about the comment above that says "Windows-1251 and -1252 will only succeed if the entire string consists of high-byte characters in a certain range": I wanted to find out what these "legal" characters are, and it's 0x80 to 0x9f. Anything outside that (even a plain "A") in your string and it will _not_ be identified as Windows-1252, even though it very much might be. Yes, of course, this is completely useless, as probably every Windows-1252 string will contain valid ASCII as well.

See https://github.com/php/php-src/blob/94e15ff3877f842e5eb5c89e3aeab214fb4a3a33/ext/mbstring/libmbfl/filters/mbfilter_cp1252.c#L138 for the source code that does this.
 [2015-07-16 17:55 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Package: *General Issues +Package: mbstring related -Assigned To: +Assigned To: cmb
 [2015-07-16 17:55 UTC] cmb@php.net
As requinix has explained, this is not a bug.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 16:01:28 2024 UTC