php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #61354 htmlentities and htmlspecialchars doesn't respect the default_charset
Submitted: 2012-03-12 03:03 UTC Modified: 2013-01-05 15:17 UTC
From: hufeng1987 at gmail dot com Assigned:
Status: Not a bug Package: Strings related
PHP Version: 5.4.0 OS: Linux/Windows/
Private report: No CVE-ID: None
 [2012-03-12 03:03 UTC] hufeng1987 at gmail dot com
Description:
------------
I am using php 5.4, i got a trouble with htmlspecialchars, htmlentities.

php 5.4 default charset is utf-8.

i thought htmlspecialchars, htmlentities may be using utf-8 as default encoding,

but even i configured default_charset in my php.ini , the htmlspecialchars and htmlentities still stupid using utf-8.

this is a bad expirence, my project is a little big, htmlspecialchars using every where, almost  3 million called.

i had no chance to specified encoding  by hand.


add encoding to each call of htmlspecialchars and htmlentities not possible, it is a huge change for me .


for another solution, why not php let htmlspecialchars using encoding by php.ini settings?

is it a better way? is it friendly to users?

sorry for my bad english.

Test script:
---------------
<?php
$string = '<pre><p>我是测试</p></pre>';

echo htmlspecialchars($string);
echo htmlspecialchars($string, NULL, 'GB2312');

Expected result:
----------------
htmlspecialchars should using charset defined by php.ini 

default_charset.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-03-12 05:36 UTC] laruence@php.net
-Summary: htmlentities and htmlspecialchars do not working with default_charset ini set +Summary: htmlentities and htmlspecialchars doesn't respect the default_charset -Status: Open +Status: Verified
 [2012-03-12 05:47 UTC] rasmus@php.net
There is some confusion around this point. The default_charset in your php.ini 
file is meant to be the output encoding. What you specify here is what ends up 
in the HTTP Content-type response header. You should be able to change that 
without messing up your internal runtime encoding which is why setting that does 
not automatically change the internal encoding used by 
htmlspecialchars/htmlentities. You can force it to look at it by setting the 3rd 
arg (the encoding) arg of the htmlspecialchars() call to "" (and empty string). 
This is documented on the http://php.net/htmlspecialchars page. But, like I 
mentioned, you should be able to change your output encoding separately from 
your internal runtime encoding, so we don't suggest doing this. The safest 
approach is to explicitly set your encoding on your htmlspecialchars() calls. 
There times when you get data from sources that have different encodings so two 
htmlspecialchars() calls in the same app may need to use different encodings.
 [2012-03-12 05:47 UTC] rasmus@php.net
-Status: Verified +Status: Not a bug
 [2012-03-12 05:56 UTC] hufeng1987 at gmail dot com
if this was not a bug, why this change blocked our old project?


in previous PHP under php 5.4 ,  we could using htmlspecialchars as simple:

htmlspecialchars($string);

and this call should not broken the string. 

but now, under php 5.4, the default encoding change to utf-8. which may broken old codes.

it is impossible to rewrite old code ,add charset encoding specified.
 [2012-03-12 06:04 UTC] rasmus@php.net
What do you mean it is impossible to rewrite old code? In previous versions 
htmlspecialchars() didn't respect the default_charset ini setting either. It only 
looks at that setting if you pass an empty string as the encoding. The change in 
PHP 5.4 was simply to switch from ISO-8859-1 to UTF8 when you do not specify a 
charset.
 [2012-03-12 06:05 UTC] hufeng1987 at gmail dot com
may be you are right , php 5.4 should have utf-8 as the default encoding. 


but , as production enviroment, this will cause more accident.


why not  php wisely handle default_charset ? that will free us from recoding.
 [2012-03-12 06:12 UTC] hufeng1987 at gmail dot com
When your project using GB2312 as default charset encoding,  when you upgrade to php 5.4,  you will find htmlspecialchars will not working as usual.

if you want them working correctly, you should replace following code with new:

old code:

htmlspecialchars($string);

new code:

htmlspecialchars($string, NULL, 'GB2312');

recoding the full project is a huge work.

especially when the project is old.
 [2012-03-12 18:27 UTC] tokul at users dot sourceforge dot net
Two small comments.

Could you write your Chinese symbols in hex notation. That way they are more friendly for pages written in other charset?

Your test code is
-----
<?php
$string = "<pre><p>\xce\xd2\xca\xc7\xb2\xe2\xca\xd4</p></pre>";

echo var_dump(htmlspecialchars($string));
echo var_dump(htmlspecialchars($string, NULL, 'GB2312'));
-----
Expected result - both var_dumps should be the same.

> htmlspecialchars should using charset defined by php.ini default_charset.

htmlspecialchars() should not use charset defined in PHP configuration. It should use iso-8859-1 for backwards compatibility reasons.
 [2012-03-12 19:29 UTC] tokul at users dot sourceforge dot net
> if you want them working correctly, you should replace following code 
> with new:
> old code:
> 
> htmlspecialchars($string);
> 
> new code:
>
> htmlspecialchars($string, NULL, 'GB2312');


htmlspecialchars($string, ENT_COMPAT, 'GB2312');

Default is to sanitize double quotes.
 [2012-04-11 15:37 UTC] moonwalker at hotbox dot ru
Actually, it's a bug. Or at least a lack of customizability.
You're forcing thousands of PHP developers to move to UTF-8 (and urgently patch their legacy code) and don't give a choice of default encoding in certain cases.
It would much better not to force determine_charset() in ext/standard/html.c to return hardcoded cs_utf_8 as default encoding (WTF?) but use default_charset option value for example. At least it WILL BE CONFIGURABLE without a need to rewrite all existing htmlspecialchars() / htmlspecialchars_decode() / html_entity_decode() calls.
Holy crap, you're just wasted so much time with that.
 [2012-05-19 16:46 UTC] wxiaoguang at gmail dot com
I consider this as a bug, too.

My old code using charsets other than utf-8/ISO-8859-1 is totally 
broken by php5.4's new htmlspecialchars.

in php5.3: ISO-8859-1 doesn't break any charset.
in php5.4: gbk/gb2312 characters are broken and I get empty 
strings after htmlspecialchars.

It's impossible to find all htmlspecialchars and add the 'utf-8' 
parameter to them in old projects. 

As a result, I can not upgrade to php5.4
 [2012-08-08 11:30 UTC] aheckmann at m-s dot de
We also have the problem with broken php code in 5.4.

It is really a huge amount of work, to switch old projects. 
We scanned our source files and found over 25.000 lines with htmlspecialchars(), not only written by us, also in many 3rd party libraries.

So we also can not switch these projects to php 5.4.

A solution to set the default encoding vi php.ini/ini_set() back to iso8859-1 would be great.
 [2012-08-08 12:16 UTC] giodev at panozzo dot it
Yes, this is a HUGE problem for us also. We migrate a single server with a single homemade application and we lost 3-4 hours to fix all htmlspecialchars() and htmlentities() to force encoding back to ISO-8859-1.

And really, under these conditions we will never migrate other apps/servers to PHP 5.4. Too much work to be done not only by us, but also by external contractors and customer of hosting services.

It's a big cost both for us and for our customers.
 [2012-08-23 06:31 UTC] hufeng1987 at gmail dot com
though php 5.4.6 released, but these problem still exists.
do you really care the php developers?
 [2012-08-27 12:54 UTC] goodwaiter at gmail dot com
It's really a bug.
but we can fix this by a easy way(I'm use Windows):
just change you php from "Non Thread Safe" to "Thread Safe"
And everythings will be ok now.
 [2012-08-27 13:17 UTC] goodwaiter at gmail dot com
虽然更换到非线程安全版本能解决问题,但是难保下次版本,PHP不会把非线程安全版本的正常表现当作
BUG更新掉,所以我们要让PHP知道这确实是一个BUG。
BUG的中文表述很简单:
php.ini的Default encoding或者mb-internal-encoding的设置,对htmlspecialchars, 
htmlentities无效,这两个函数顽强的使用自己的utf-8,自以为是。
更流行的表述是:
htmlspecialchars, htmlentities独立使用自己的编码,这违反统一大局,PHP肯定不能容忍这样的存
在,必须要铲除这个BUG
 [2012-08-27 16:24 UTC] goodwaiter at gmail dot com
the post above is wrong,change from "Non Thread Safe" to "Thread Safe" cant fix 
it.

right fix way:
so guys above can change to windows + iis + isapi module + php, or windows + 
Apache + fastcgi/isapi module + php, can fix this bug.

I just test, only windows + iis + fastcgi + php will show the bug.
in isapi module or windows + Apache + fastcgi + php works fine.

and I test that,the bug maybe because of iis fastcgi.
in this case, Zend Multibyte Support allways be "provided by mbstring", even if 
I change zend.multibyte to off or on, it still "provided by mbstring".
in phpinfo() show below:

mbstring
Multibyte Support  enabled  
Multibyte string engine  libmbfl  
HTTP input encoding translation  disabled  
libmbfl version  1.3.2  

mbstring extension makes use of "streamable kanji code filter and converter", 
which is distributed under the GNU Lesser General Public License version 2.1. 

Multibyte (japanese) regex support  enabled  
Multibyte regex (oniguruma) version  4.7.1  

and I use this code to test on windows + iis6 + fastcgi + php:

echo(mb_internal_encoding());// show ISO-8859-1, right
$text = "我是测试";//use cp936 chinese chars
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "CP936";
$ary[] = "UTF-8";
echo mb_detect_encoding($text, $ary);//show cp936, right
mb_detect_order($ary);//set detect order
echo(htmlspecialchars($text));//show enmty, wrong !!!!!!!!!!!
echo mb_detect_encoding(htmlspecialchars($text), $ary);//show ASCII(not the guy 
think is utf8), wrong !!!!!!!!
echo(mb_internal_encoding());// show ISO-8859-1, not change, right

this test can show us that, all things goes right except htmlspecialchars(), 
where is the "ASCII" from? maybe iis6 + fastcgi cause this bug. but it's really 
a php bug.
 [2012-08-27 16:37 UTC] goodwaiter at gmail dot com
another fix way in code is that:
use htmlspecialchars($text,NULL,"")
not need to add "utf8" "cp936",  or other in "", just leave "" blank, it will use 
the current page's encoding like no this bug.
 [2012-08-27 17:04 UTC] goodwaiter at gmail dot com
because use htmlspecialchars($text,NULL,""); can make works fine;

so php Developer can fix this bug in this easy way:
just make "omitted encoding" works like encoding with "", and all things will be 
ok.
 [2012-11-28 09:28 UTC] x dot bazilio at gmail dot com
This is a bug.
Just upgraded php and got empty string on many projects.
I cant't change code in CMS, because i am not a developer of CMS. I am using CMS 
fore develop web sites.
 [2012-12-28 17:36 UTC] rudibr at gmail dot com
This is a serious backward incompatibility (and not even listed as such).

I am also not able to upgrade to 5.4 because of this, and have advised all of my 
clients which I provide server consulting to do not upgrade as well.

No defaults of any kind should be changed arbitrarily , without notice and 
without 
possibility of customization. It breaks code, and makes everyone affected very 
uneasy on any future relases.

Like all here I hope this get the serious attention it should have gotten 
already.
 [2013-01-05 03:53 UTC] leaflet at leafok dot com
I am facing the same problem.

After upgrading to PHP 5.4.10 in the product environment, all the GB2312 encoding data on the page became blank. This badly influenced the whole site.

It is undoubtedly a backward compatible issue. Wish it could be resolved soon.
 [2013-01-05 03:55 UTC] hufeng1987 at gmail dot com
Please fix it as soon as possible.
 [2013-01-05 04:20 UTC] rasmus@php.net
You will need to update your code to be compatible with PHP 5.4 either by 
explicitly providing the charset, or by passing in "" to pick up the default one. 
Anything short of that is a security issue. Code that didn't do this in PHP 5.3 
is potentially insecure depending on which charset is being used, so no, nothing 
will be fixed here. We will not revert to 5.3 behaviour.
 [2013-01-05 04:26 UTC] hufeng1987 at gmail dot com
you made one step, but kill the php programmer.

do you know how much more code need to rewrite and check?

if your change broken user programm, it's your lost, not the user's lost.
 [2013-01-05 04:40 UTC] rasmus@php.net
Code that is currently likely to be insecure, yes. We only make changes like this 
when we are forced to for security reasons.
 [2013-01-05 09:53 UTC] x dot bazilio at gmail dot com
Please, fix it.
It is so simple to provide default params. Wy should we put NULL and empty 
string? Where is security problem to not put NULL and empty string if they are 
will be default values of that params?
 [2013-01-05 09:54 UTC] hufeng1987 at gmail dot com
pass null and empty string that could improve security? no sense..
 [2013-01-05 15:17 UTC] rasmus@php.net
I have explained that a few times. We can't default it automatically because the  
encoding may not match the output encoding. Only the developer knows that. If we 
did that automatically it would break even more sites. The sites where the 
encodings differ need to set it explicitly.
 [2013-01-05 16:05 UTC] leaflet at leafok dot com
I understand your consideration. Maybe a global configuration in PHP.ini or page 
lifecycle set function could be provided for encoding setting of these functions. 
Developers would be glad to handle this setting centrally by a include header file 
for each pages.
 [2013-01-05 17:39 UTC] x dot bazilio at gmail dot com
Ok. If i did not set defautlt time zone, i get E_WARNING.
Let us set default encoding for htmlspecialchars. It is not posible to persuade 
developers of Drupal, joomla, wordpress, bitrix, ets., and developers of modules 
for that CMS to rewrite their code.
I wrote to tech support of bitrix (russian cms). They said that i must use PHP 
5.3.x. They not going to rewrite code.
 [2013-01-27 17:32 UTC] kstirn at gmail dot com
It will soon be a year since the release of PHP 5.4 and there still is no easy way (read: a global PHP setting) to overcome this huge backwards-incompatibility. 

PHP developers, I understand the security concerns, but please don't be so stubborn and give us an option to set a default setting without having to modify *all* legacy code to work with 5.4.

Your action (or lack thereof) is producing the opposite results of desired - instead of moving to PHP 5.4, thousands of servers (including several we own) will stay with 5.3.x even after end of life cycle in March 2013.

*Fact*
A simple global setting (an optional php.ini value) would solve the issue for thousands of users while addressing security issues by explicitly defining the default charset to be used by affected functions - all without having to rewrite existing code.

PHP team please do reconsider this and help everyone not using UTF-8 move to PHP 5.4.

Thank you!
 [2013-02-26 21:29 UTC] rudibr at gmail dot com
What about my third-party modules? Should I change their code as well? Do I now 
need to verify and manually alter code on third-party modules everytime I 
upgrade or install them?

If Im using a component with protected code, do I need to go trough their 
support staff and wait for a correction? What if they provide no reliable 
support or customization, am I now being encouraged to hack and crack in the 
source code just so I can fix this?

It is easy , even redundant , and absolutely justfiable to create a new ini 
setting to control this behavior, that I feel a little bit offended by the 
current attitude of php developers over this issue.

I also feel a little bit offended because the guy who is responsible for this 
change EXPLICITLY stated that the change to UTF-8 defaulting have nothing to do 
with security. It just sounded like a "better default", according to the 
developer. Hardly a seriously thought-trough consideration.

This is becoming quite a sad state of affairs. I guess I will have to consider 
moving on from php if it comes to that.
 [2013-05-19 13:10 UTC] minder at ufive dot unibe dot ch
For legacy projects in latin1 we substitute htmlspecialchars with the self-made 
function htmlXspecialchars according to these instructions: 
http://ufive.unibe.ch/?c=php54entitiesfix&q=&l=e
 [2013-05-20 18:14 UTC] kstirn at gmail dot com
@minder at ufive dot unibe dot ch

Yes, this can be done, but still means we would have to manually modify hundreds of legacy scripts on the server (many third party and many obfuscated/encoded)  to be able to upgrade to PHP 5.4. 

It would be really easy to fix with an ini setting and it would indeed make sense to have a setting for such a huge default change. I am disappointed that the PHP dev team has decided to completely ignore the issue.
 [2013-06-15 22:51 UTC] jbolder42 at yahoo dot com
I was wondering if someone could enlighten me by explaining why this:

htmlspecialchars($str, ENT_QUOTES, "ISO-8859-1");

... would be considered any more secure than something like this:

ini_set("html.default_charset", "ISO-8859-1");
htmlspecialchars($str, ENT_QUOTES);

Thank you!
 [2013-07-12 10:57 UTC] tototation at gmail dot com
Yes, i'm interested too to understand that fact.
I recently upgrade my server, and ALL my code is unusable !
A search in code found +470 000 words htmlentities or htmlspecialchars !!!!!
HOW TO CHANGE ALL THIS ????? THAT'S IMPOSSIBLE !!!!!!!!

Thanks, we must stop all our services and websites.
Just for a stupid thing.
 [2013-07-12 13:15 UTC] kstirn at gmail dot com
Instead of moving on to PHP 5.4 and PHP 5.5 thousands of servers will stay with legacy PHP 5.3 due to this single, easy to solve (ini setting) issue that the PHP team has decided to ignore.
 [2013-07-20 12:49 UTC] stemind at gmail dot com
Zend should be convinced. The Zend htmlspecialchars Initiative 
http://ufive.ch/tzhi/
 [2013-09-17 08:48 UTC] b83 at yandex dot ru
Moreover it will be impossible to upgrade to newer OS versions and use PHP versions from distro. Which is even more a security issue.

http://askubuntu.com/questions/306487/install-php-5-3-on-ubuntu-13-04
 [2013-10-03 08:08 UTC] support at playnext dot ru
For those still looking for a solution around this headache - pls consider:
1. http://php.net/manual/en/function.override-function.php
2. http://php.net/manual/ru/function.runkit-function-redefine.php

The idea - you override the built-in htmlspecialchars() function with your customized variant which is able to respect non UTF-8 default encoding. This small piece of code can be then easily inserted somewhere at the start of yout project. No need to rewrite all htmlspecialchars() entries globally.

I've spent several hours with both approaches. Variant 1 looks good especaially in combination with http://www.php.net/manual/en/function.rename-function.php as it allows to call original htmlspecialchars() with just altered default args. The code could be as follows:

rename_function('htmlspecialchars', 'renamed_htmlspecialchars');
function overriden_htmlspecialchars($string, $flags=NULL, $encoding='cp1251', $double_encode=true) {
	$flags = $flags ? $flags : (ENT_COMPAT|ENT_HTML401);
	return renamed_htmlspecialchars($string, $flags, $encoding, $double_encode);
}
override_function('htmlspecialchars', '$string, $flags, $encoding, $double_encode', 'return overriden_htmlspecialchars($string, $flags, $encoding, $double_encode);');
?>

Unfortunatelly this didn't work for me properly - my site managed to call overriden function but not every time I reloaded the pages. Moreover other PHP sites crashed under my Apache server as they suddenly started blaming htmlspecialchars() was not defined. I suppose I had to spend more time to make it work thread/request/site/whatever-safe.

So I switched to runkit (variant 2). It worked for me, although even after trying runkit_function_rename()+runkit_function_add() I didn't managed to recall original htmlspecialchars() function. So as a quick solution I decided to call htmlentities() instead:

<?php
function overriden_htmlspecialchars($string, $flags=NULL, $encoding='UTF-8', $double_encode=true) {
    $flags = $flags ? $flags : (ENT_COMPAT|ENT_HTML401);
    $encoding = $encoding ? $encoding : 'cp1251';
    //return renamed_htmlspecialchars($string, $flags, $encoding, $double_encode);
    return htmlentities($string, $flags, $encoding, $double_encode);
}
runkit_function_redefine('htmlspecialchars', '$string, $flags, $encoding, $double_encode', 'return overriden_htmlspecialchars($string, $flags, $encoding, $double_encode);'); 
?>

You may be able to implement your more powerfull overriden function.
Sorry, if this topic is not bug-related. I support all the reports here - a small update to the default behaviour ruined our days...
Thank you.
 [2014-08-26 10:22 UTC] kstirn at gmail dot com
Has this issue been solved with upcoming PHP 5.6.0's default_charset setting?
http://php.net/manual/en/ini.core.php#ini.default-charset
 [2014-09-09 09:16 UTC] kstirn at gmail dot com
I tested it and in PHP 5.6.0 setting default_charset to "ISO-8859-1" in php.ini indeed seems to solve this issue.

But beware - PHP 5.6.0 will also send a HTTP header with the default_charset, so a different charset in HTML tags will be ignored by browsers unless you output your own header!
 
PHP Copyright © 2001-2018 The PHP Group
All rights reserved.
Last updated: Thu Jun 21 14:01:42 2018 UTC