|   | php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login | 
| 
  [2008-01-20 02:12 UTC] arnaud dot lb at gmail dot com
 Description: ------------ htmlspecialchars/htmlentities returns an empty string when the input contains an invalid unicode sequence. I think these functions should just skip the invalid sequences or encode them byte by byte (e.g. 0xE9 => é), instead of discarding the whole string. Sometimes you have to display arbitrary strings of unknow encoding. So you make them more safe using htmlspecialchars($string, ENT_COMPAT, "site_encoding, utf-8 in my case"), but if there is at least one invalid sequence in the string, it returns an empty string :/ Reproduce code: --------------- $string = "Voil\xE0"; // "Voil?", in ISO-8859-15 var_dump(htmlspecialchars($string, ENT_COMPAT, "utf-8")); Expected result: ---------------- string(4) "Voil" OR string(10) "Voilà" Actual result: -------------- string(0) "" PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits             | |||||||||||||||||||||||||||||||||||||
|  Copyright © 2001-2025 The PHP Group All rights reserved. | Last updated: Fri Oct 31 06:00:01 2025 UTC | 
Added ENT_IGNORE as a compatibility flag to skip invalid multibyte sequences instead of returning an empty string (as iconv's //IGNORE). These functions will still never return an invalid or incomplete multibyte sequence. Example: htmlspecialchars("...", ENT_QUOTES | ENT_COMPAT, "utf-8");echo "test = " . htmlentities("some text", ENT_QUOTES | ENT_IGNORE, 'UTF-8', false); returns: test = echo "test = " . htmlentities("some text", ENT_QUOTES | ENT_IGNORE, 'UTF-8'); returns: test = some text The latter is the expected result, but why does adding the fourth parameter, to prevent double-encoding, cause this function (and also htmlspecialchars) to return the empty string? How can this be prevented? I have a form that I want to redisplay to users until all their input has been corrected, preserving their responses in the fields so they can start from what worked. The users are international, with names containing lots of accent marks and utf-8 characters, and some of the input is mathematical, with Greek characters and such, so I want to assume the input is utf-8 to preserve all of this, without messing it up on multiple passes. Thanks for your help.I can't reproduce that: <?php echo htmlentities("some\x80 text>", ENT_QUOTES | ENT_IGNORE, 'UTF-8', false), "\n"; echo htmlentities("some\x80 text<", ENT_QUOTES | ENT_IGNORE, 'UTF-8'); gives the expected some text> some text&lt;