|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #61451 Document that the default character set change may break existing code
Submitted: 2012-03-20 11:18 UTC Modified: 2020-08-13 12:12 UTC
Avg. Score:4.6 ± 0.8
Reproduced:9 of 9 (100.0%)
Same Version:6 (66.7%)
Same OS:5 (55.6%)
From: perske at uni-muenster dot de Assigned: cmb (profile)
Status: Closed Package: Documentation problem
PHP Version: 5.4.0 OS: n/a
Private report: No CVE-ID: None
 [2012-03-20 11:18 UTC] perske at uni-muenster dot de
From manual page:
The change of the default value of the $encoding parameter of html_entity_decode(), htmlentities() and htmlspecialchars() may break existing code. (It does for me!) Thus it should be mentioned on the "Backward Incompatible Changes" page. (There is a remark on the "Other changes" page, but that remark omits html_entity_decode().)

I propose adding this text as a list item:

"The default value of the $encoding parameter of html_entity_decode(), htmlspecialchars(), and htmlentities() has been changed from 'ISO-8859-1' to 'UTF-8'."

Test script:

Expected result:

Actual result:


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2014-07-23 21:15 UTC] ky dot patterson at adlinkr dot com
This remains a serious problem.
This change needs to be highlighted as an important BC break for those upgrading from (5 < 5.4) to 5.4 or 5.5 or 5.6.

I will explain why it is so severe, because most other descriptions I've seen of this problem somewhat miss the point:

Previously the default encoding was ISO-8859-1.
Now the default encoding is UTF-8.

There is no such thing as invalid ISO-8859 -- any byte stream looks like latin1.

There is very definitely such a thing as invalid UTF-8 -- for example, any ISO-8859 that is not pure 7-bit ASCII.

htmlspecialchars() and company will silently reject the input string if it is invalid for the given encoding.
They just return an empty string, they don't even issue a warning.

I didn't know this. No one who habitually uses ISO-8859 would know this, because it doesn't happen with ISO-8859.

So if you have a lot of code that looks like this:
echo htmlspecialchars($input);
and $input is 8-bit ISO-8859, then you are going to have a problem when you upgrade to 5.4 or 5.5, you're going to have no output.

Worse, if $input is usually 7-bit ASCII and is only occasionally 8-bit ISO-8859-1 -- a common situation in North America -- then you are in for a subtle but very severe problem.
It won't be obvious during your pre-upgrade tests, and you will end up having to do a mass find & replace on your codebase in a hurry after the upgrade.

This isn't a hypothetical case. I just upgraded from 5.3 to 5.5 and got bit by this in production code.
I did my due diligence prior to upgrading: I thoroughly reviewed all of the upgrading notes for 5.4 and 5.5, and even for 5.6 which was still in alpha.
I even reviewed the changelog -- which by the way makes no mention whatsoever of this change.
Even a single hint about this change would have saved me.

My suggestion, for the many others who will have to upgrade from 5.2 or 5.3 at some point, is:

1) Add a note to the Backwards Incompatible Changes page for 5.4
Something like this:

"The default encoding used by htmlentities() and htmlspecialchars() has changed from ISO-8859-1 to UTF-8. This means that calling htmlspecialchars($string) will return nothing if $string is not valid UTF-8. If your input is not UTF-8, change your code to explicitly specify an encoding, e.g. htmlspecialchars($string, NULL, 'ISO-8859-1')"

2) Add a similar note to the Changed Functions page for 5.4

3) Add a note to the Changelog page (or whichever version the change actually happened at)

4) Add a note to the Changed Functions page for 5.6
In 5.6 htmlentities and htmlspecialchars have been changed again, to honour the default_encoding INI directive.
There should be a note about this, to reflect the fact that 5.6 can be made to work like 5.3, whereas 5.4 and 5.5 cannot be.
 [2015-01-15 21:12 UTC] djonline at djonline dot ru
Also fix this bug by using default charset from setlocale, like any other function.
Today setlocale not change default charset to this function, so this is a big incompatible with valid existing 5.3 code.
 [2017-01-28 11:58 UTC]
-Status: Open +Status: Verified
 [2020-08-13 12:10 UTC]
-Assigned To: +Assigned To: cmb
 [2020-08-13 12:12 UTC]
-Status: Verified +Status: Closed
 [2020-08-13 12:12 UTC]
Automatic comment on behalf of cmb
Log: Fix #61451: Document that the default character set change may break existing code
 [2020-08-14 02:35 UTC]
Automatic comment on behalf of mumumu
Log: Fix #61451: Document that the default character set change may break existing code
 [2020-12-30 11:59 UTC]
Automatic comment on behalf of mumumu
Log: Fix #61451: Document that the default character set change may break existing code
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Wed Jun 16 06:01:25 2021 UTC