php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #61451 Document that the default character set change may break existing code
Submitted: 2012-03-20 11:18 UTC Modified: 2020-08-13 12:12 UTC
Votes:11
Avg. Score:4.6 ± 0.8
Reproduced:9 of 9 (100.0%)
Same Version:6 (66.7%)
Same OS:5 (55.6%)
From: perske at uni-muenster dot de Assigned: cmb (profile)
Status: Closed Package: Documentation problem
PHP Version: 5.4.0 OS: n/a
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: perske at uni-muenster dot de
New email:
PHP Version: OS:

 

 [2012-03-20 11:18 UTC] perske at uni-muenster dot de
Description:
------------
---
From manual page: http://www.php.net/migration54.incompatible
---
The change of the default value of the $encoding parameter of html_entity_decode(), htmlentities() and htmlspecialchars() may break existing code. (It does for me!) Thus it should be mentioned on the "Backward Incompatible Changes" page. (There is a remark on the "Other changes" page, but that remark omits html_entity_decode().)

I propose adding this text as a list item:

"The default value of the $encoding parameter of html_entity_decode(), htmlspecialchars(), and htmlentities() has been changed from 'ISO-8859-1' to 'UTF-8'."




Test script:
---------------
n/a

Expected result:
----------------
n/a

Actual result:
--------------
n/a

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-07-23 21:15 UTC] ky dot patterson at adlinkr dot com
This remains a serious problem.
This change needs to be highlighted as an important BC break for those upgrading from (5 < 5.4) to 5.4 or 5.5 or 5.6.

I will explain why it is so severe, because most other descriptions I've seen of this problem somewhat miss the point:

Previously the default encoding was ISO-8859-1.
Now the default encoding is UTF-8.

There is no such thing as invalid ISO-8859 -- any byte stream looks like latin1.

There is very definitely such a thing as invalid UTF-8 -- for example, any ISO-8859 that is not pure 7-bit ASCII.

htmlspecialchars() and company will silently reject the input string if it is invalid for the given encoding.
They just return an empty string, they don't even issue a warning.

I didn't know this. No one who habitually uses ISO-8859 would know this, because it doesn't happen with ISO-8859.

So if you have a lot of code that looks like this:
<?php
echo htmlspecialchars($input);
?>
and $input is 8-bit ISO-8859, then you are going to have a problem when you upgrade to 5.4 or 5.5, you're going to have no output.

Worse, if $input is usually 7-bit ASCII and is only occasionally 8-bit ISO-8859-1 -- a common situation in North America -- then you are in for a subtle but very severe problem.
It won't be obvious during your pre-upgrade tests, and you will end up having to do a mass find & replace on your codebase in a hurry after the upgrade.

This isn't a hypothetical case. I just upgraded from 5.3 to 5.5 and got bit by this in production code.
I did my due diligence prior to upgrading: I thoroughly reviewed all of the upgrading notes for 5.4 and 5.5, and even for 5.6 which was still in alpha.
I even reviewed the changelog -- which by the way makes no mention whatsoever of this change.
Even a single hint about this change would have saved me.


My suggestion, for the many others who will have to upgrade from 5.2 or 5.3 at some point, is:

1) Add a note to the Backwards Incompatible Changes page for 5.4
http://ca2.php.net/manual/en/migration54.incompatible.php
Something like this:

"The default encoding used by htmlentities() and htmlspecialchars() has changed from ISO-8859-1 to UTF-8. This means that calling htmlspecialchars($string) will return nothing if $string is not valid UTF-8. If your input is not UTF-8, change your code to explicitly specify an encoding, e.g. htmlspecialchars($string, NULL, 'ISO-8859-1')"

2) Add a similar note to the Changed Functions page for 5.4
http://ca2.php.net/manual/en/migration54.parameters.php

3) Add a note to the Changelog page
http://php.net/ChangeLog-5.php#5.4.0 (or whichever version the change actually happened at)

4) Add a note to the Changed Functions page for 5.6
http://ca2.php.net/manual/en/migration56.changed-functions.php
In 5.6 htmlentities and htmlspecialchars have been changed again, to honour the default_encoding INI directive.
There should be a note about this, to reflect the fact that 5.6 can be made to work like 5.3, whereas 5.4 and 5.5 cannot be.
 [2015-01-15 21:12 UTC] djonline at djonline dot ru
Also fix this bug by using default charset from setlocale, like any other function.
Today setlocale not change default charset to this function, so this is a big incompatible with valid existing 5.3 code.
 [2017-01-28 11:58 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2020-08-13 12:10 UTC] cmb@php.net
-Assigned To: +Assigned To: cmb
 [2020-08-13 12:12 UTC] cmb@php.net
-Status: Verified +Status: Closed
 [2020-08-13 12:12 UTC] phpdocbot@php.net
Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=doc/en.git;a=commit;h=5ce8426acc7b181fce8243cf8dc6d60c4ad88fd9
Log: Fix #61451: Document that the default character set change may break existing code
 [2020-08-14 02:35 UTC] phpdocbot@php.net
Automatic comment on behalf of mumumu
Revision: http://git.php.net/?p=doc/ja.git;a=commit;h=2b596f4cd4b9dba2cc56898f57ae65e2196c4c8a
Log: Fix #61451: Document that the default character set change may break existing code
 [2020-12-30 11:59 UTC] nikic@php.net
Automatic comment on behalf of mumumu
Revision: http://git.php.net/?p=doc/ja.git;a=commit;h=34895184fb8b9c39bbe048f34b71e03b3361d9e7
Log: Fix #61451: Document that the default character set change may break existing code
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 21:01:29 2025 UTC