PHP :: Request #47494 :: htmlspecialchars does not throw E_WARNING on multibyte problems |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #47494 htmlspecialchars does not throw E_WARNING on multibyte problems
Submitted: 2009-02-24 13:57 UTC Modified: 2012-10-19 06:46 UTC
From: philipp dot feigl at gmail dot com Assigned:
Status: Not a bug Package: Strings related
PHP Version: 5.2.8 OS: CentOS5
Private report: No CVE-ID: None
 [2009-02-24 13:57 UTC] philipp dot feigl at gmail dot com
When using htmlspecialchars with a invalid multibyte string and using UTF-8 as encoding, there are two possible outcomes based on the "display_errors" ini setting:

1. display_errors=1
=> empty string is returned

2. display_errors=0
=> E_WARNING is thrown

This is exactly what the code states. Can be viewed in
on line 1147

However this is VERY confusing as a developer point of view. As I have display_errors always set to "1" for debugging purposes, I never realized, one of our locale strings was corrupt, as it was just emptied out.

Now in the production environment, our error handler terminates the script because of the E_WARNING beeing thrown.

While both of the ways (empty string / error) are acceptable for me - because ofcourse the input string is invalid, it is very confusing to have different behaviors of PHP based on the display_errors setting.

Reproduce code:
echo 'a' . htmlspecialchars(substr(utf8_encode('a?'), 0, 2), ENT_QUOTES, 'UTF-8') . 'b';

Expected result:
Either 'ab'

However not both based on display_errors

Actual result:
display_errors=1 => 'ab'
display_errors=0 => E_WARNING


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2009-02-25 13:48 UTC]
It's intentional. If you disagree, please ask why it is like this (I once reverted that :) 
 [2009-11-20 20:24 UTC]
The idea is to return an error but not display it (i.e. log it or allow custom error handlers to process it). 

The reason for it is that, unfortunately, people run servers in production with display_errors=On, and php_escape_html_entities_ex can be triggered from all kinds of code that usually doesn't produce errors, which can reveal sensitive information on public sites.  So we chose to go after lesser of two evils and not generate the error in this context.

For debugging, I would suggest always logging errors and checking the error log, as some errors may be hard to spot in display anyway (especially true if your script produces something like JSON).
 [2010-06-14 13:30 UTC] trueleader at gmx dot de
Why the developer of the language create a workaround for bad configured servers and/or applications?

If a configuration variable tells that errors are shown on screen then I think all errors (dependent on reporting level) are shown - and not that they can be only logged if the configuration variable is turned off.
I think/hope this is not only my opinion.

We just lost some data, because we fill a JS confirm message on a HTML click event with a string from a PHP language variable. Nobody knows that we missed to utf8_encode because all developers use display_errors on and therefor no error is shown/logged for this problem
 [2011-03-10 17:37 UTC] pinkgothic at gmail dot com
I'm afraid this isn't just confusing, but actually punishes people who do it right by blindsiding them completely.


 * display_errors off
 * an Exception-throwing error handler

As far as I'm informed, this is good practise. (I acknowledge I may be misinformed.)

However, due to this behaviour, you suddenly get application crashes in production without that anyone really 
understands why the code snippet is suddenly a culprit. 'But we tested it with broken UTF-8, why are -we- just 
getting empty strings? And the documentation says that's what we should be expecting...'

> If a configuration variable tells that errors are shown on screen then I
> think all errors (dependent on reporting level) are shown - and not that
> they can be only logged if the configuration variable is turned off.
> I think/hope this is not only my opinion.

Yeah, you're not alone; but live and learn, I guess? :)

> For debugging, I would suggest always logging errors and checking the
> error log, as some errors may be hard to spot in display anyway
> (especially true if your script produces something like JSON).

Well, from my experience, people who deliberately turn display_errors on for development except their feedback in 
the browser window [*unless* they are writing for XHR], not in a log they may also be running in parallel.

This is especially true if you have a complex application that logs debug information from several services into one 
file in a compact format - so, unless you're LOOKING for an error, you won't spot anything. (Total sidenote, I 
honestly wish I could change the log format I have to suffer... but I'm stuck with it. Gargh.)

If you've been a good developer and read the manual on htmlspecialchars(), you're not going to expect an error. 
You're going to expect an empty string. Unfortunately currently, nothing in the documentation reveals that the 
function results in an E_WARNING, either pure-log-only when display_errors is on, or log and trigger_error()ed when 
display_errors is off.

By the time you find this closed php bug... well, if you're lucky, you've forced your developers to have a wrapper 
function you can now add a try-catch to - otherwise you can now do a project-wide search for every call of the 
function. [To be fair, I was fortunately lucky.]

Could this bug please get REOPENED as a documentation bug? I think adding this behaviour to the documentation would 
help a lot of people confused by it.
 [2011-03-10 18:05 UTC]
I say this is a logic error. Bugs #54109 and #52397 also mention the same 
behavior in two other spots of code. 
php_error_docref already handles display_errors configurations... I don't how 
this would be considered correct/intended 

Questionable logic:

if(!PG(display_errors)) { 
	php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid multibyte sequence 
in argument");
 [2011-05-02 14:48 UTC] example at example dot com
Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too! Me too!
 [2011-05-03 17:33 UTC]
This isn't a logic error. The idea is to prevent a user-triggered information 
leak by not showing this error to the user in case a production server is 
misconfigured and running with display_errors turned on.
 [2011-05-03 17:48 UTC] pinkgothic at gmail dot com
Could this bug please get REOPENED as a documentation bug
then? As already stated, the absence of the information in
the documentation can be crippling for people who do things
-right-. (Admittedly right now "htmlspecialchars" has my
comment on the subject, but that's hardly official...)

(Sidenote: You might also want to close Bug #54109 as bogus
for consistency.)
 [2011-06-01 18:36 UTC] larry at garfieldtech dot com
This bug should be reopened, not just documented.  Haven't we learned our lesson with magic_quotes and its ilk?  Designing PHP to try and save the user when the user does something stupid always backfires.  Always.  MySQL has the same problem, and it backfires there, too.

The current logic is simply backward.  "When display_errors is on, you get all errors except from this function.  When display_errors is off, you get no errors except from this one function."  There is no logical reason for that.

I'm working on a project that has been stalled for over a week while we try to figure out what's wrong with the character encoding configuration on our production server, only to realize that the data is (probably) bad but we didn't know it because of this bug.

This is a bug and should be fixed, not simply documented as dumb.

If a production server is misconfigured, that's not the job of the language to fix.  All that does is, as another commenter noted, punish those who configure their servers properly.  If anything, it is a security hole for people who DO configure their server properly by turning off display_errors, as then these strings would get echoed in production.  How is that helpful to anyone?
 [2011-09-27 22:43 UTC] rudd-o at rudd-o dot com
Reported to /r/lolphp here:

Do you guys realize there is an ENTIRE COMMUNITY of people devoted to the 
collective practice of FACEPALMING at PHP's fails?

 [2012-07-01 15:12 UTC] chris at cbsinteractive dot com
Happening our production servers, can replicate, PHP 5.3.10, Centos 5.6
 [2012-07-01 15:34 UTC]
-Type: Bug +Type: Feature/Change Request
 [2012-07-01 15:34 UTC]
This really isn't a bug. I do agree that the approach isn't ideal, but we 
shouldn't throw warnings on bad input here because htmlspecialchars() is 
explicitly designed to clean up bad input and it is run directly on user data 
most of the time. In order for someone to avoid this warning they would need to 
first call something like iconv('utf-8','utf-8') to clean up the input data and 
that doesn't make much sense since htmlspecialchars() essentially does that 
already. But, in order to help debugging there should be some way to see why an 
htmlspecialchars() call failed so a last_error() function similar to how it is 
handled for json decoding would make sense.
 [2012-08-30 19:01 UTC] another_disappointed_php_programmer at exam
This is very sad.

This is a bug, and it's sad that PHP core developers said that it's a feature and it won't be fixed. I'm disappointed.
 [2012-08-30 19:21 UTC]
@the disappointed user: PHP 5.4 no longer throws said warning (it was just confusing). Instead there are several new options for dealing with incorrect encoding. Of particular interest is ENT_SUBSTITUTE, which will replace invalid code unit sequences with the Unicode Replacement Character (instead of returning a rather unhelpful empty string). This way you can easily spot where the string is incorrectly encoded. Furthermore this option has the additional advantage of being more graceful (it just removed individual incorrectly encoded bytes, not the whole string).

Hope this helps you. More info in the docs:
 [2012-09-06 11:39 UTC] lzsiga at freemail dot c3 dot hu
Imho htmlspecialchars should not check for multi-byte validity at all, because it only deals with a few characters that are all in ASCII7, so it could safely ignore every byte between 0x80 and 0xFF. The third parameter could be simply ignored (as if it were 'ISO-8859-1')
 [2012-09-06 15:29 UTC]
You assume ASCII7 compatibility for all encodings which is a bad assumption.
 [2012-09-06 15:33 UTC]
Also note that many, if not most, apps use this as their only validity filter and 
if you output invalid UTF-8, for example, it can lead to security problems like 
the well-known IE 0xE0 XSS exploit. So at some point along the line you have to 
do a multi-byte check and it may as well be here since we need to do it anyway.
 [2012-09-13 17:07 UTC] lzsiga at freemail dot c3 dot hu
If the name of the function were 'check_for_multibyte_validity_and_htmlspecialchars' then you'd be right, but even then I'd lobby for a simple 'htmlspecialchars' function... Doing something (ie multibyte validity check) that the user (the PHP-programmer in this case) didn't specifically ask doesn't seem to me to be a good idea (see magic_quotes for another example).

PS: Of course I wouldn't complaining (or even know about the whole question) if the default value hadn't been changed to 'UTF-8' in 5.4.
 [2012-09-13 17:25 UTC]
By simple I assume you mean an htmlspecialchars() function that doesn't check the 
validity of the characters. The problem is that we have to do that. We can't 
encode characters without understanding which charset we are dealing with and we 
need to make sure that the character we are looking at is a valid one. The world 
has moved beyond 7-bit ASCII, sorry.
 [2012-09-13 18:53 UTC] lzsiga at freemail dot c3 dot hu
It would be a valid reason, if there were any plan to support utf16/32, as iso-8859-x and utf-8 are ASCII-compatible. But even then, the default value for the $encoding parameter still could be 'ascii(or compatible)'.

Or, like some other string operations, there could be a mb_htmlspecialchars function.
 [2012-10-19 06:11 UTC] user at dudmail dot com
Not showing with display_errors = 1 to avoid leaks on badly configured servers, while showing and thus leaking sensitive information with properly configured servers? This is lame.
 [2012-10-19 06:46 UTC]
user at dudmail dot com you seem confused. A properly configured server doesn't 
have display_errors on in production so I don't see how it is leaking in that 

And as was pointed out, this code has been revamped in 5.4 to give you a number 
of options of what to do with invalid chars which means there is no need for the 
warning anymore.
 [2012-10-19 07:40 UTC] lzsiga at freemail dot c3 dot hu
It is all moot point now: there was an incompatible change in 5.4, so everyone who want to develop portable code has to explicitly specify encoding='ISO-8859-1' (Even if they are using ISO-8859-2. Or windows-1250. Or CP852. Or UTF-8. Or it is not yet known at programming time. (Why, yes, there are programmers who develop routines/libraries that will be used by others.))
PHP Copyright © 2001-2018 The PHP Group
All rights reserved.
Last updated: Sat Aug 18 00:01:26 2018 UTC