php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #62341 htmlspecialchars() should return NULL on (encoding) failure
Submitted: 2012-06-17 10:06 UTC Modified: 2016-07-01 13:34 UTC
Votes:6
Avg. Score:4.7 ± 0.5
Reproduced:6 of 6 (100.0%)
Same Version:2 (33.3%)
Same OS:0 (0.0%)
From: bfanger at gmail dot com Assigned:
Status: Open Package: Strings related
PHP Version: 5.4.4 OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: bfanger at gmail dot com
New email:
PHP Version: OS:

 

 [2012-06-17 10:06 UTC] bfanger at gmail dot com
Description:
------------
In PHP 5.4 the default encoding for htmlentities is changed to 'UTF-8',

When a ISO-8859-1 encoded string with a special character is passed to the  
htmlspecialchars() it returns an empty string (invalid mutlibyte sequence)
This is the new intended (and more secure) behavior, and i agree, but...

The old default (ISO-8859-1) worked on both UTF-8, ISO-8859-1 and other ascii 
compatible encodings, which is reflected in the documentation:

"Calling htmlspecialchars() is sufficient if the encoding supports all characters 
in the input string (such us UTF-8 but also ISO-8859-1 on ISO-8859-1 only input). 
htmlentities() needs to be called only if the output encoding doesn't support all 
characters in the input string."

This is no longer the case, unless ENT_IGNORE is passed.

Solution:
Drop the paragraph from the documentation.

PS:
You might wan't to add a paragraph that incorrect encoded text will cause 
htmlspecialschars() to return an empty string.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-06-17 10:25 UTC] bfanger at gmail dot com
-Summary: htmlspecialchars() should work on ascii compatible encodings by default. +Summary: Secure behavior htmlspecialchars() not reflected in the documentation
 [2012-06-17 10:25 UTC] bfanger at gmail dot com
Updated summary to "Secure behavior htmlspecialchars() not reflected in the 
documentation"

My initial change request "htmlspecialchars() should work on ascii compatible 
encodings by default" no longer applies. 
After some research agree with the new behavior.
 [2012-06-17 17:47 UTC] bfanger at gmail dot com
Rereading the manpage more thoroughly, all the info is there. Another nice 
resource is http://nikic.github.com/2012/01/28/htmlspecialchars-improvements-in-
PHP-5-4.html

I now disagree with the decision of the empty string, with php flexible typing 
this should have been false or null.
In php5.4 no longer has the weird 'only errors when "display_errors" is off 
behavior', but sadly the chosen behaviour is to alway silently supress those 
errors.
If throwing E_WARING is too risky, an E_ENCODING error level would be very 
welcome addition.

ENT_IGNORE: Removes special characters from the string instead of ignoring them. 
(My previous statement "unless ENT_IGNORE is passed." is therefor invalid)

Using strtr($text, array('<' => '&lt;', '>' => '&gt;', '&' => '&amp;')); is 35% 
slower than htmlspecialchars($text, ENT_NOQUOTES, 'ISO-8859-1') which has the 
same output.

The securityrisk applies only to multibyte encoding which always uses 2 or more 
bytes per characters, like UTF-16 (but UTF-16 and UTF-32 aren't supported by 
htmlspecialchars, i'm not sure if any of the supported charsets is incompatible 
with ascii)

My framework uses UTF-8 for 95% percent of the time, but to prevent silent 
trucating i'll have to add 'ISO-8859-1' as encoding. It just feels wrong.

The default charset for htmlspecialchars should be "ASCII compatible"

"the encodings ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R 
are effectively equivalent"
no ifs, no buts.
 [2012-06-17 17:47 UTC] bfanger at gmail dot com
-Summary: Secure behavior htmlspecialchars() not reflected in the documentation +Summary: htmlspecialchars() should work on ascii compatible encodings by default. -Type: Documentation Problem +Type: Feature/Change Request -Package: Documentation problem +Package: *Unicode Issues
 [2012-06-18 14:26 UTC] rasmus@php.net
EUC-JP is heavily used, supported by htmlspecialchars and it is not ASCII 
compatible.
 [2012-07-03 19:39 UTC] Bonefish26 at aol dot com
Everything is fine with htmlspecialcahrs until someone copies data from their auto formatted ms word document and puts it in the update box. Setting the charset option seems to solve the problem.
 [2012-09-06 15:36 UTC] andreas dot rieber at t-online dot de
I also spotted that problem on an older iso-8859-1 application. I could now convert the database to utf-8 or change ca. 150 places in the old code.

Then i checked the problem a bit closer: it is user input, so we don't really know what charset it is. We can only assume it is the charset we published the page in. That might be wrong but with the new htmlspecialchars behavior we would show nothing instead of partly wrong input.

I made some tests and it looks like best is to change my code (even for applications which use utf-8) to:

htmlspecialchars( $text, 0, "iso-8859-1");

There must be a better way... To return nothing is not really good.
 [2012-09-06 15:43 UTC] rasmus@php.net
The problem with setting it to 8859-1 is that it lets everything through. If your 
page is actually in UTF-8 it means you are now vulnerable to 0xE0 XSS invalid 
UTF-8 style attacks. In PHP 5.4 we have addressed this by adding an 
ENT_SUBSTITUTE option that lets you substitute any invalid chars instead of 
returning an empty string.
 [2012-09-07 06:38 UTC] andreas dot rieber at t-online dot de
OK, understood. So i will go for a wrapper function where i can set the charset global and report an error in any case (to identify user problems, potential xss trouble or simply wrong database entries).
 [2016-06-29 14:13 UTC] cmb@php.net
-Status: Open +Status: Feedback -Package: *Unicode Issues +Package: Strings related -Assigned To: +Assigned To: cmb
 [2016-06-29 14:13 UTC] cmb@php.net
> The default charset for htmlspecialchars should be "ASCII
> compatible"

This is not an option, as has been explained by Rasmus.

> I now disagree with the decision of the empty string, with php
> flexible typing this should have been false or null.

If you still feel that the return value should be changed, please
adjust the title of this feature request.
 [2016-06-30 16:31 UTC] bfanger at gmail dot com
-Summary: htmlspecialchars() should work on ascii compatible encodings by default. +Summary: htmlspecialchars() should return NULL on (encoding) failure -Status: Feedback +Status: Assigned
 [2016-06-30 16:31 UTC] bfanger at gmail dot com
Updated the summary as requested.
From: "htmlspecialchars() should work on ascii compatible encodings by default."
To: "htmlspecialchars() should return NULL on (encoding) failure"

I still think it would be useful if we could somehow get access to the encoding error.
 [2016-07-01 13:34 UTC] cmb@php.net
-Status: Assigned +Status: Open -Assigned To: cmb +Assigned To:
 [2016-07-01 13:34 UTC] cmb@php.net
> I still think it would be useful if we could somehow get access
> to the encoding error.

At least for now you could use ENT_SUBSTITUTE, and check whether
the returned string contains REPLACEMENT CHARACTERs (and their
position, if desired).
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 05:01:29 2024 UTC