php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #66618 UTF-8 encoding error
Submitted: 2014-01-31 15:06 UTC Modified: 2021-08-04 11:07 UTC
From: francois dot gannaz at silecs dot info Assigned: cmb (profile)
Status: Closed Package: Website problem
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
 [2014-01-31 15:06 UTC] francois dot gannaz at silecs dot info
Description:
------------
The encoding of most pages is invalid. The most frequent error is that each underscore character in the left column (e.g. the list of functions) is followed by an invalid byte.

The behaviors of the web browsers varies. Most silently ignore the wrong bytes, and some display a special character for each error.

The [W3C validator](http://validator.w3.org/) confirms the problem. Here is its answer when asked to validate "http://php.net/manual/en/function.array-merge.php":

"Sorry, I am unable to validate this document because on line 3400 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. 

The error was: utf8 "\xE9" does not map to Unicode"


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-01-31 16:48 UTC] francois dot gannaz at silecs dot info
After some debugging, there are 2 distinct problems:

1. Some comments on PHP pages are badly encoded. They contain invalid characters, which prevent the W3C validator (or iconv) from parsing the page. There such problems among the 420 Kb of "http://php.net/manual/en/function.array-merge.php".

2. The problem with "zero-width spaces" that get printed is probably related to CSS, not to encoding bugs. At least with Opera 15, disabling the following line fixes the display:
body, input, textarea { 
    font-family: "Source Sans Pro", "Helvetica", "Arial", sans-serif;
}
 [2014-01-31 17:05 UTC] bjori@php.net
-Status: Open +Status: Feedback
 [2014-01-31 17:05 UTC] bjori@php.net
Your second problem sounds like: https://github.com/php/web-php/pull/32

As for the 1st one; I have deleted the note that had the invalid character point.

Can you recheck in like 60minutes and see if it works?
 [2014-01-31 18:12 UTC] cmbecker69 at gmx dot de
| As for the 1st one; I have deleted the note that had the 
| invalid character point.

Actually, at least the comment by rafmavCHEZlibre_in_france is most
likely encoded as ISO-8859-1.  The offending character is an é, 
which is quite common in several languages.  There might be a lot 
more of these comments. Instead of deleting them, it might be 
worth transcoding them to UTF-8.
 [2014-02-03 09:14 UTC] francois dot gannaz at silecs dot info
-Status: Feedback +Status: Open
 [2014-02-03 09:14 UTC] francois dot gannaz at silecs dot info
Removing one comment did not change the encoding problem. Here is the direct link to the W3C HTML validator on one of the offending pages:
http://validator.w3.org/check?uri=http%3A%2F%2Fphp.net%2Fmanual%2Fen%2Ffunction.array-merge.php&charset=%28detect+automatically%29&doctype=Inline&group=0
 [2014-02-03 14:23 UTC] bjori@php.net
I can't imagine that a handful of latin-1 encoded characters in the user submitted data are causing your browser to present you weird data or broken navigation.
 [2014-02-03 14:47 UTC] cmbecker69 at gmx dot de
If a browser shall display a page in UTF-8 encoding, and there are
unrecognized code points, it will substitute them by the Unicode
replacement character U+FFFD[1].

You can see that yourself if you visit the array_search man page[2],
and search for "greetz Udo". Just below this text is an "ANSI" 
encoded non-breaking space character (0xA0), which is displayed as
<?>.

[1] <http://www.fileformat.info/info/unicode/char/0fffd/index.htm>
[2] <http://php.net/manual/en/function.array-search.php>
 [2014-02-03 14:50 UTC] bjori@php.net
thanks for the explaination, but I'm still missing the actual bug here.

Is this ticket for making sure all user contributed notes are utf-8?
 [2014-02-03 14:59 UTC] cmbecker69 at gmx dot de
I have not opened the ticket, but it seems reasonable to convert
all old user contributed notes to UTF-8; the newer ones seem to be
anyway.
 [2014-02-03 15:00 UTC] francois dot gannaz at silecs dot info
As I wrote in the first note:
| The problem with "zero-width spaces" that get printed is probably related to CSS, not to encoding bugs.

Sorry, my bad, I thought that the encoding bugs that cause the W3C validator to refuse to validate the page also caused the display error. As I wrote, I later discovered that they were separated problem. See bug 66196 for the display error.

I still think that serving well encoded pages would be better, even if it means deleting (or applying iconv) to old comments.
 [2021-08-04 11:07 UTC] cmb@php.net
-Status: Open +Status: Closed -Assigned To: +Assigned To: cmb
 [2021-08-04 11:07 UTC] cmb@php.net
The mentioned user notes have been removed in the meantime; there
might be others with non UTF-8 encoding, but these can be handled
on a case by case basis.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Nov 02 03:01:27 2024 UTC