php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #64667 mb_detect_encoding problem
Submitted: 2013-04-18 14:08 UTC Modified: 2015-07-16 17:55 UTC
Votes:6
Avg. Score:4.8 ± 0.4
Reproduced:6 of 6 (100.0%)
Same Version:2 (33.3%)
Same OS:1 (16.7%)
From: jisgro at teliae dot fr Assigned: cmb (profile)
Status: Not a bug Package: mbstring related
PHP Version: 5.3.24 OS: Debian Edge
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: jisgro at teliae dot fr
New email:
PHP Version: OS:

 

 [2013-04-18 14:08 UTC] jisgro at teliae dot fr
Description:
------------
php 5.3.3
We open a file with ANSII encoding, we set the encoding with the "iconv_set_encoding("internal_encoding", "UTF-8");" function to UTF8
the mb_detect_encoding return before and after the encoding : Format : ISO-8859-1 

The function is in the test script, it returns : 

Format : ISO-8859-1 mystere
ééé ééé é éé à à à à à , <-> , ��� ��� � �� � � � � � ,
Format : ISO-8859-1 

Test script:
---------------
function convertirFichierEnUTF8($sNomFichier){
  $sContenuFichier = file_get_contents($sNomFichier);
  if($sContenuFichier == ''){//cas vide et cas erreur de lecture
    return;
  }

  $tabFormatsReconnus = array(
     'ASCII'
    ,'ISO-8859-1'
    ,'ISO-8859-2'
    ,'ISO-8859-15'
    ,'UTF-8'
    ,'UTF-16'
    ,'UTF-32'
    ,'Windows-1251'
    ,'Windows-1252'
  );
  $sFormat = mb_detect_encoding($sContenuFichier, $tabFormatsReconnus, true);
  //echo $sNomFichier."\n";
  echo "Format : ".$sFormat."\n";

  if($sFormat === false){
    CLog::trace('Erreur encodage du fichier '.$sNomFichier.' inconnu', 'Conversion fichier', 'Erreur détection encodage', 0,
        CLog::INIVEAU_ERREUR_CRITIQUE, CConfig::$sEmail_Trace_Erreur);
    return;
  }

  //Les formats suivants n'ont pas besoin de conversion
  if(in_array($sFormat, array('UTF-8', 'ASCII'))){
    return;
  }

  iconv_set_encoding("internal_encoding", "UTF-8");
  //iconv_set_encoding("output_encoding", "UTF-8");
  $sNouveauContenu = iconv($sFormat, 'UTF-8', $sContenuFichier);

  //Si la conversion a eu un problème
  if($sNouveauContenu === ''){
    CLog::trace('Erreur à la conversion en UTF8 du fichier '.$sNomFichier, 'Conversion fichier', 'Erreur conversion UTF8', 0,
       CLog::INIVEAU_ERREUR);
    $sNouveauContenu = iconv($sFormat, 'UTF-8//IGNORE', $sContenuFichier);
    CreeRepSiNonExiste(CConfig::$sRepertoire_log, 'erreursConversionFichiers');
    file_put_contents(CConfig::$sRepertoire_log.'erreursConversionFichiers/'.basename($sNomFichier), $sContenuFichier);
  }

  //On sauvegarde le résultat de la conversion
  file_put_contents($sNomFichier, $sNouveauContenu);
  echo ($sContenuFichier === $sContenuFichier ? 'aie aie aie c pareil':'mystere' );
  ttt($sNouveauContenu,'<->',$sNouveauContenu);
  $sFormat = mb_detect_encoding($sNouveauContenu, $tabFormatsReconnus, true);
  //echo $sNomFichier."\n";
  echo "Format : ".$sFormat."\n";
}

Expected result:
----------------
return format in UTF8

Actual result:
--------------
Format : ISO-8859-1 

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-04-18 14:53 UTC] jisgro at teliae dot fr
sorry, the Test script is not very simple, here is a "light version":

function convertFileInUTF8($sFileName){
  $sFileContent = file_get_contents($sFileName);

  $tabKnownEncoding = array(
      'ASCII'
      ,'ISO-8859-1'
      ,'ISO-8859-2'
      ,'ISO-8859-15'
      ,'UTF-8'
      ,'UTF-16'
      ,'UTF-32'
      ,'Windows-1251'
      ,'Windows-1252'
  );
  $sFormat = mb_detect_encoding($sFileContent, $tabKnownEncoding, true);

  echo "Format : ".$sFormat."\n";

  iconv_set_encoding("internal_encoding", "UTF-8");

  $sNewContent = iconv($sFormat, 'UTF-8', $sFileContent);

  //Save
  file_put_contents($sFileName, $sNewContent);

  ttt($sFileContent,'<->',$sNewContent);
  $sFormat = mb_detect_encoding($sNewContent, $tabKnownEncoding, true);

  echo "Format : ".$sFormat."\n";
 [2013-05-01 01:17 UTC] mail+php at requinix dot net
In general, and in your circumstance, mb_detect_encoding() will return the first 
encoding that matches. The requirements for "ASCII" are that the bytes are all 
<0x80 and your file won't match that. Next is "ISO-8859-1" which doesn't really 
have any requirements at all. And that's the problem: it will always succeed.

The array should be arranged with most exclusive (harder to validate) encodings 
first and most permissive (easier to validate) last. I would start testing with:

[UTF-32, UTF-16, UTF-8, ASCII, ???]

The problem is what to fall back to. ISO 8859-1, -2, and -15 will always 
succeed, and Windows-1251 and -1252 will only succeed if the entire string 
consists of high-byte characters in a certain range. (Why do those two work like 
that? No clue.)

So make a choice: if a string is neither UTF-* nor simple ASCII what do you 
think it probably is? You're writing code in French so I'm going to guess ISO 
8859-15.

[UTF-32, UTF-16, UTF-8, ASCII, ISO-8859-15]

If you want to go beyond that then you can do some rudimentary character 
analysis: some byte combinations may make sense in one encoding but not in 
another. Example: about half of \xA0-\xAF bytes in ISO 8859-1/-15 are symbols 
but are characters in ISO 8859-2.
 [2014-04-23 13:37 UTC] scy-bugs-php at scy dot name
In case anybody is curious about the comment above that says "Windows-1251 and -1252 will only succeed if the entire string consists of high-byte characters in a certain range": I wanted to find out what these "legal" characters are, and it's 0x80 to 0x9f. Anything outside that (even a plain "A") in your string and it will _not_ be identified as Windows-1252, even though it very much might be. Yes, of course, this is completely useless, as probably every Windows-1252 string will contain valid ASCII as well.

See https://github.com/php/php-src/blob/94e15ff3877f842e5eb5c89e3aeab214fb4a3a33/ext/mbstring/libmbfl/filters/mbfilter_cp1252.c#L138 for the source code that does this.
 [2015-07-16 17:55 UTC] cmb@php.net
-Status: Open +Status: Not a bug -Package: *General Issues +Package: mbstring related -Assigned To: +Assigned To: cmb
 [2015-07-16 17:55 UTC] cmb@php.net
As requinix has explained, this is not a bug.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 08:01:29 2024 UTC