PHP Bugs  
php.net | support | documentation | report a bug | advanced search | search howto | statistics | login

go to bug id or search bugs for  

Bug #43896 htmlspecialchars() returns empty string on invalid unicode sequence
Submitted:20 Jan 2008 2:12am UTC Modified: 26 Nov 2008 4:33am UTC
From:arnaud dot lb at gmail dot com Assigned to:
Status:To be documented Category:Strings related
Version:5CVS-2008-07-15 OS:*
Votes:32 Avg. Score:4.7 ± 0.7 Reproduced:29 of 30 (96.7%)
Same Version:17 (58.6%) Same OS:17 (58.6%)
View/Vote Add Comment Developer Edit Submission

Have you experienced this issue?
Rate the importance of this bug to you:

[20 Jan 2008 2:12am UTC] arnaud dot lb at gmail dot com
Description:
------------
htmlspecialchars/htmlentities returns an empty string when the input 
contains an invalid unicode sequence.

I think these functions should just skip the invalid sequences or 
encode them byte by byte (e.g. 0xE9 => é), instead of 
discarding the whole string.

Sometimes you have to display arbitrary strings of unknow encoding. 
So you make them more safe using htmlspecialchars($string, 
ENT_COMPAT, "site_encoding, utf-8 in my case"), but if there is at 
least one invalid sequence in the string, it returns an empty 
string :/

Reproduce code:
---------------
$string = "Voil\xE0"; // "Voilą", in ISO-8859-15

var_dump(htmlspecialchars($string, ENT_COMPAT, "utf-8"));

Expected result:
----------------
string(4) "Voil"

OR 

string(10) "Voilà"

Actual result:
--------------
string(0) ""
[24 Jan 2008 12:29pm UTC] arnaud dot lb at gmail dot com
I made a patch for this bug:

http://s3.amazonaws.com/arnaud.lb/php_htmlentities_utf.patch

The internal get_next_char() function returns a status of FAILURE 
when it encounters a invalid or incomplete sequence, which causes 
the htmlspecialchars and htmlentities functions to return an empty 
string.

This patch modify the behavior of these functions to skip invalid 
sequences, without discarding the whole string. This involves a very 
few changes and makes the behavior of theses functions more 
consistent with previous PHP versions.

It also adds a few tests to htmlentities-utf.phpt.
[24 Jan 2008 8:51pm UTC] tallyce at gmail dot com
See also bugs 43294 and 43549 which seem to be the same thing.

This is really starting to bite now. Please can this be fixed, or
suggest how we can reliably process incoming user data in UTF8 given
this behaviour change!
[17 Feb 2008 1:25pm UTC] andreas dot ravnestad at gmail dot com
This seems to be breaking PEAR::Text_Wiki completely when using UTF-8:
http://pear.php.net/bugs/bug.php?id=13136
[5 May 2008 9:00pm UTC] heurika at gmail dot com
Hi,
I've got the same Bug, posted on #43740.
Please fix it.

Thanks!
[27 Jun 2008 5:32pm UTC] sillyxone at yaoo dot com
  is also affected in 5.2, for example:

$str = 'Hello' . chr(160) . 'there';
print(htmlentities($str, ENT_COMPAT, 'UTF-8'));

Instead of printing "Hello there", it prints nothing (empty string). The
same for htmlspecialchars().

Both functions work fine in 5.1
[18 Jul 2008 12:10am UTC] moriyoshi@php.net
I even don't think this is a valid bug in the first place. You passed a

string that is encoded in ISO-8859-15 to htmlspecialchars() while 
specifying UTF-8 to force the string to be treated as "UTF-8". One 
should never depend on the past wrond behaviour with which invalid byte

sequences pass through. Besides, you can always work around it by giving

ISO-8859-15 to the third argument.

[11 Sep 2008 12:52pm UTC] yunosh@php.net
Not considering this as a bug (or rather a regression) is a major flaw
IMO.
htmlspecialchars() is *THE* tool that developers are encouraged to use
when escaping output of data that comes from an unknown source. By
nature you can't always rely on this data to be perfectly valid. People
copy and paste from Word to HTML forms and do all kind of weird stuff to
get data into a website.
Simply discarding the complete data just because it's not a completely
valid character stream is going break all kind of websites with user
generated content.
[21 Oct 2008 12:08pm UTC] jani@php.net
Actually "The tool" to use for incoming data is the filter extension..
[2 Nov 2008 1:27pm UTC] jani@php.net
Arnaud, fix it yourself.
[26 Nov 2008 4:30am UTC] lbarnaud@php.net
Added ENT_IGNORE as a compatibility flag to skip invalid multibyte
sequences instead of returning an empty string (as iconv's //IGNORE).
These functions will still never return an invalid or incomplete
multibyte sequence.
Example: htmlspecialchars("...", ENT_QUOTES | ENT_COMPAT, "utf-8");
[26 Nov 2008 4:33am UTC] lbarnaud@php.net
It seems "Fixed in CVS and need to be documented" does not changes the
status if it is set to "Assigned" :/

RSS feed | show source 

PHP Copyright © 2001-2009 The PHP Group
All rights reserved.
Last updated: Sat Nov 21 10:30:49 2009 UTC