php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #43896 htmlspecialchars() returns empty string on invalid unicode sequence
Submitted: 2008-01-20 02:12 UTC Modified: 2011-08-11 09:38 UTC
Votes:35
Avg. Score:4.7 ± 0.6
Reproduced:32 of 33 (97.0%)
Same Version:17 (53.1%)
Same OS:20 (62.5%)
From: arnaud dot lb at gmail dot com Assigned: cataphract (profile)
Status: Closed Package: Strings related
PHP Version: 5CVS-2008-07-15 OS: *
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: arnaud dot lb at gmail dot com
New email:
PHP Version: OS:

 

 [2008-01-20 02:12 UTC] arnaud dot lb at gmail dot com
Description:
------------
htmlspecialchars/htmlentities returns an empty string when the input 
contains an invalid unicode sequence.

I think these functions should just skip the invalid sequences or 
encode them byte by byte (e.g. 0xE9 => é), instead of 
discarding the whole string.

Sometimes you have to display arbitrary strings of unknow encoding. 
So you make them more safe using htmlspecialchars($string, 
ENT_COMPAT, "site_encoding, utf-8 in my case"), but if there is at 
least one invalid sequence in the string, it returns an empty 
string :/

Reproduce code:
---------------
$string = "Voil\xE0"; // "Voil?", in ISO-8859-15

var_dump(htmlspecialchars($string, ENT_COMPAT, "utf-8"));


Expected result:
----------------
string(4) "Voil"

OR 

string(10) "Voilà"

Actual result:
--------------
string(0) ""

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2008-01-24 12:29 UTC] arnaud dot lb at gmail dot com
I made a patch for this bug:

http://s3.amazonaws.com/arnaud.lb/php_htmlentities_utf.patch

The internal get_next_char() function returns a status of FAILURE 
when it encounters a invalid or incomplete sequence, which causes 
the htmlspecialchars and htmlentities functions to return an empty 
string.

This patch modify the behavior of these functions to skip invalid 
sequences, without discarding the whole string. This involves a very 
few changes and makes the behavior of theses functions more 
consistent with previous PHP versions.

It also adds a few tests to htmlentities-utf.phpt.
 [2008-01-24 20:51 UTC] tallyce at gmail dot com
See also bugs 43294 and 43549 which seem to be the same thing.

This is really starting to bite now. Please can this be fixed, or suggest how we can reliably process incoming user data in UTF8 given this behaviour change!
 [2008-02-17 13:25 UTC] andreas dot ravnestad at gmail dot com
This seems to be breaking PEAR::Text_Wiki completely when using UTF-8: http://pear.php.net/bugs/bug.php?id=13136
 [2008-05-05 21:00 UTC] heurika at gmail dot com
Hi,
I've got the same Bug, posted on #43740.
Please fix it.

Thanks!
 [2008-06-27 17:32 UTC] sillyxone at yaoo dot com
  is also affected in 5.2, for example:

$str = 'Hello' . chr(160) . 'there';
print(htmlentities($str, ENT_COMPAT, 'UTF-8'));

Instead of printing "Hello there", it prints nothing (empty string). The same for htmlspecialchars().

Both functions work fine in 5.1
 [2008-07-18 00:10 UTC] moriyoshi@php.net
I even don't think this is a valid bug in the first place. You passed a 
string that is encoded in ISO-8859-15 to htmlspecialchars() while 
specifying UTF-8 to force the string to be treated as "UTF-8". One 
should never depend on the past wrond behaviour with which invalid byte 
sequences pass through. Besides, you can always work around it by giving 
ISO-8859-15 to the third argument.




 [2008-09-11 12:52 UTC] yunosh@php.net
Not considering this as a bug (or rather a regression) is a major flaw IMO.
htmlspecialchars() is *THE* tool that developers are encouraged to use when escaping output of data that comes from an unknown source. By nature you can't always rely on this data to be perfectly valid. People copy and paste from Word to HTML forms and do all kind of weird stuff to get data into a website.
Simply discarding the complete data just because it's not a completely valid character stream is going break all kind of websites with user generated content.
 [2008-10-21 12:08 UTC] jani@php.net
Actually "The tool" to use for incoming data is the filter extension..
 [2008-11-02 13:27 UTC] jani@php.net
Arnaud, fix it yourself.
 [2008-11-26 04:30 UTC] lbarnaud@php.net
Added ENT_IGNORE as a compatibility flag to skip invalid multibyte sequences instead of returning an empty string (as iconv's //IGNORE). These functions will still never return an invalid or incomplete multibyte sequence.
Example: htmlspecialchars("...", ENT_QUOTES | ENT_COMPAT, "utf-8");
 [2008-11-26 04:33 UTC] lbarnaud@php.net
It seems "Fixed in CVS and need to be documented" does not changes the status if it is set to "Assigned" :/
 [2010-10-11 03:15 UTC] cataphract@php.net
Automatic comment from SVN on behalf of cataphract
Revision: http://svn.php.net/viewvc/?view=revision&revision=304297
Log: - Documented addition of ENT_IGNORE as per bug #43896
  (changed its status from TBD to Closed).
 [2010-10-11 03:16 UTC] cataphract@php.net
-Status: To be documented +Status: Closed -Assigned To: +Assigned To: cataphract
 [2010-10-11 03:16 UTC] cataphract@php.net
Noted addition of ENT_IGNORE in the manual entries for htmlspecialchars and htmlentities.
 [2011-02-06 20:58 UTC] shaun dot bruno at gmail dot com
I'm still having this problem - running php 5.2.15
 [2011-02-06 21:02 UTC] shaun dot bruno at gmail dot com
Ah... I realized I need 5.3
 [2011-08-11 04:59 UTC] hardin at boulder dot nist dot gov
echo "test = " . htmlentities("some text", ENT_QUOTES | ENT_IGNORE, 'UTF-8', false);
returns: test = 

echo "test = " . htmlentities("some text", ENT_QUOTES | ENT_IGNORE, 'UTF-8');
returns: test = some text

The latter is the expected result, but why does adding the fourth parameter, to prevent double-encoding, cause this function (and also htmlspecialchars) to return the empty string?  How can this be prevented?

I have a form that I want to redisplay to users until all their input has been corrected, preserving their responses in the fields so they can start from what worked.  The users are international, with names containing lots of accent marks and utf-8 characters, and some of the input is mathematical, with Greek characters and such, so I want to assume the input is utf-8 to preserve all of this, without messing it up on multiple passes.  Thanks for your help.
 [2011-08-11 09:38 UTC] cataphract@php.net
I can't reproduce that:

<?php
echo htmlentities("some\x80 text&gt;", ENT_QUOTES | ENT_IGNORE, 'UTF-8', false), "\n";
echo htmlentities("some\x80 text&lt;", ENT_QUOTES | ENT_IGNORE, 'UTF-8');

gives the expected

some text&gt;
some text&amp;lt;
 [2011-08-12 22:18 UTC] hardin at boulder dot nist dot gov
cataphract, thanks for looking at this.
When I try your tests, neither line generates any output.  I am using PHP Version 5.3.5-1ubuntu7.2.  Any ideas where I should look for a problem?  Could this be a server configuration issue?  Thanks.
 [2011-08-12 23:01 UTC] hardin at boulder dot nist dot gov
My humble apologies...
I am working on two servers and I got confused as to which was which.  The problem you could not reproduce was because I was actually running PHP Version 4.3.9.  On PHP Version 5.3.5 it works as expected.  I regret wasting your time with my error.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 22:01:28 2024 UTC