php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #41554 utf8_encode is missing characters
Submitted: 2007-06-01 00:57 UTC Modified: 2007-06-05 19:09 UTC
Votes:2
Avg. Score:5.0 ± 0.0
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: victorepand at gmail dot com Assigned:
Status: Not a bug Package: Strings related
PHP Version: 4.4.7 OS: Linux
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: victorepand at gmail dot com
New email:
PHP Version: OS:

 

 [2007-06-01 00:57 UTC] victorepand at gmail dot com
Description:
------------
I have used the function utf8_encode to encode iso-8859-1 pages into UTF-8 and displayed them on my site, but strange and funny characters are appearing such as "?" and "?".

It turns out that the iso-8859-1 page contains the use of characters such as these:
?,?,?,?,?,?,?,?
These characters display fine on my browser from the iso-8859-1 page, but when I use the utf8_encode function and display it on my utf-8 page, the result is garbled.

So I have found the only solution is to manually convert all of the characters above before using the utf8_encode function and that solves the problem crudely, but it is not a perfect solution. What if I have missed any characters? Isn't there a cleaner method, a PHP function, that will do all this conversion without worry and without missing any characters?



Reproduce code:
---------------
Here is an example of an iso-8859-1 page which displays fine on my browser, but contains such characters such as ?,?,?,?,?,?,?,? mentioned above:
http://www.jardenstore.com/product.aspx?bid=18&pid=1251


Expected result:
----------------
After using the utf8_encode function, I expected to see the page displaying correctly again on my UTF-8 page with these characters intact: ?,?,?,?,?,?,?,? 

Actual result:
--------------
Instead, the result was garbled like this:
‘,—,–,’,?,??™,??™,??,é,Ã?,™,?,?,è,Ž,?

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-06-01 01:32 UTC] ezyang@php.net
My gut reaction to your problem is to mention that you've probably mixed up ISO 8859-1 and Windows-1252: the two are commonly confused for each other, the Windows encoding containing several more characters: ???????????????????????????

However, said behavior does not precisely match up with your predicament, as ? and ? are part of ISO 8859-1. Furthermore, the URL you supplied is already encoded in UTF-8. Perhaps you are double encoding?

Either way, this is not a problem with the documentation, except possibly the fact that the user notes are waaaaay to long on utf8_encode and some of the info needs to be integrated into the main docs.
 [2007-06-04 17:27 UTC] tony2001@php.net
Thank you for this bug report. To properly diagnose the problem, we
need a short but complete example script to be able to reproduce
this bug ourselves. 

A proper reproducing script starts with <?php and ends with ?>,
is max. 10-20 lines long and does not require any external 
resources such as databases, etc. If the script requires a 
database to demonstrate the issue, please make sure it creates 
all necessary tables, stored procedures etc.

Please avoid embedding huge scripts into the report.


 [2007-06-04 22:43 UTC] victorepand at gmail dot com
Here are 2 short test scripts that demonstrate the problem:

<?php
$testhtml="<html>\n<head>\n<META http-equiv=Content-Type content=\"text/html; charset=UTF-8\">\n</head>\n<body>\nSpecial Characters: ?,?,?,?,?,?,?,?</body>\n</html>";
print $testhtml;
?>

The sample output is shown here:
http://www.vacuumfoodsealer.info/utftest2.php
Special Characters: &#65533;,&#65533;,&#65533;,&#65533;,&#65533;,&#65533;,&#65533;,&#65533;

The result is garbled which is correct in this case, because the content-type of the page is UTF-8 and the characters are not encoded.

However, the second test script:
<?php
$testhtml="<html>\n<head>\n<META http-equiv=Content-Type content=\"text/html; charset=UTF-8\">\n</head>\n<body>\nSpecial Characters: ?,?,?,?,?,?,?,?</body>\n</html>";
print utf8_encode($testhtml);
?>

Produces this output here:
http://www.vacuumfoodsealer.info/utftest.php
Special Characters: ?,&#146;,&#151;,&#147;,&#148;,?,&#153;,&#133;

This time the characters have been encoded into UTF-8. Since the content-type of the page is UTF-8 and the characters have been encoded into UTF-8, then why should they appear garbled? And if it is not a bug with utf8_encode, then what method would I use to correctly display these characters in UTF-8? I don't know of any function that will convert these characters!
 [2007-06-05 08:14 UTC] gwynne@php.net
The page you linked to as "an example of an iso-8859-1 page" appears to 
in fact be already encoded in utf-8. Treating utf-8 bytes as iso-8859-1 
and attempting the conversion will result in the "incorrect" output 
you're seeing. Make sure that the source you're giving utf8_encode() is 
in fact iso-8859-1 encoded.
 [2007-06-05 19:04 UTC] victorepand at gmail dot com
Those characters are windows-1252 encoded because they were typed into Wordpad on a Windows operation system.

So now my question is, how can my script detect the coding of a variable if it is unknown? For example, if I use this function: mb_convert_encoding($testhtml,"UTF-8","auto"), I get an "Unable to detect character encoding" error. Here is an example of that:

<?php
$testhtml="<html>\n<head>\n<META http-equiv=Content-Type content=\"text/html; charset=UTF-8\">\n</head>\n<body>\nSpecial Characters: ?,?,?,?,?,?,?,?</body>\n</html>";
$testhtml=mb_convert_encoding($testhtml,"UTF-8","auto");
print $testhtml;
?>

Sample output:
http://www.vacuumfoodsealer.info/utftest4.php
Warning: mb_convert_encoding() [function.mb-convert-encoding]: Unable to detect character encoding in /home/vgevge/public_html/vacuumfoodsealer/utftest4.php on line 3
Special Characters: &#65533;,&#65533;,&#65533;,&#65533;,&#65533;,&#65533;,&#65533;,&#65533;
 [2007-06-05 19:09 UTC] tony2001@php.net
Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 15:01:29 2024 UTC