php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #28654 Possible bug in utf8_encode (bit operations)
Submitted: 2004-06-06 22:55 UTC Modified: 2004-09-21 08:53 UTC
From: krausbn@php.net Assigned:
Status: Not a bug Package: *Languages/Translation
PHP Version: 4.3.4 OS: WinXP
Private report: No CVE-ID: None
 [2004-06-06 22:55 UTC] krausbn@php.net
Description:
------------
Hi!

I'm currently developing a nice script that generates OpenOffice SXW files by filling the content.xml (which is UTF-8 encoded) with database content. While trying to do this I found out that utf8_encode('?') (charcode 147) returns '“'. But when I checked the whole result in OffenOffice '?' is displayed as square (character unknown?!). So I made some tests with UTF-8 conversion (even mb_* functions) and recognized that characters between 128 and 160 returned by utf8_encode() don?t seem to match the standard. As mentioned above '?' is returned as '“' but should be '’' (as you will get it using UltraEdit for conversion).

Does anyone can give me some explanations here?

I?m not familiar with this UTF-8 / bit-conversion stuff, but I don?t think PHP does what it?s supposed to do here. For a first workaround I simply coded a custom_utf8_encode() that uses an own char map to override this misbehaviour (see below). Can someone help my out with this strange bug?!

Regards
Bjoern Kraus


function custom_utf8_encode($str)
{
    $chrMap = array(128 => '??', 129 => '',  130 => '‚', 131 => 'ƒ',
                    132 => '„', 133 => '…', 134 => '?? ', 135 => '‡',
                    136 => 'ˆ',  137 => '‰', 138 => '? ',  139 => '‹',
                    140 => 'Œ',  141 => '',  142 => 'Ž',  143 => '',
                    144 => '',  145 => '‘', 146 => '’', 147 => '“',
                    148 => '”', 149 => '•', 150 => '–', 151 => '—',
                    152 => '˜',  153 => '™', 154 => 'š',  155 => '›',
                    156 => 'œ',  157 => '',  158 => 'ž',  159 => 'Ÿ');
                    
    $newStr = '';

    for ($i = 0; $i < strlen($str); $i++) {
        $chrVal = ord($str[$i]);
        if ($chrVal > 127 && $chrVal < 160) {
            $newStr .= $chrMap[$chrVal];
        }
        else {
            $newStr .= utf8_encode($str[$i]);
        }
    }
    
    return $newStr;
}



Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-06-08 09:38 UTC] derick@php.net
utf8_encode only deals with iso-8859-1, which does not define characters in the range from 128 to 160. Though it should probably just replace those characters with a question mark, as that's how invalid characters are usually converted.
 [2004-06-10 00:11 UTC] krausbn@php.net
Hm, what ISO standard do I use (german, Win32) when I paste&copy Word text into a textarea and post it to a PHP script?
Is it possible to solve my problem by converting my character encoding to iso-8859-1 with the mb-functions?
 [2004-06-14 21:00 UTC] moriyoshi@php.net
Looks like you are trying to do the conversion between 
the code page 1252 and UTF-8.

http://www.microsoft.com/globaldev/reference/sbcs/
1252.htm

Let alone mbstring, most of iconv() implementations 
support CP1252 (a.k.a. IBM1252).

HTH

 [2004-07-11 21:34 UTC] sniper@php.net
Moriyoshi: Was that last comment a statement of this being a bug in PHP or what? Is this verified bug? Can you fix it if it is? (if it's not bug -> bogus..)

 [2004-07-12 18:51 UTC] moriyoshi@php.net
Most likely not a bug in PHP. The reporter better check 
out things I pointed out.

 [2004-07-14 10:25 UTC] krausbn@php.net
Sorry, but even the iconv() function doesn't do the job. Maybe someone else (german guy?) can make some tests to furnish proof to my problem.
 [2004-09-21 08:53 UTC] momo@php.net
the first bug you reported is not a bug, if you have problems with iconv fill new bug report (feeding uft8_encode with iconv output is not what moriyoshi suggest you, he suggest you don't use utf8_encode function at all)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Nov 13 17:01:30 2024 UTC