php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #43294 htmlentities with UTF8 fails if dagger character supplied
Submitted: 2007-11-14 14:39 UTC Modified: 2008-01-30 08:37 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:1 (100.0%)
Same OS:1 (100.0%)
From: tallyce at gmail dot com Assigned:
Status: Not a bug Package: Strings related
PHP Version: 5.2.5 OS: Windows or Linux
Private report: No CVE-ID: None
 [2007-11-14 14:39 UTC] tallyce at gmail dot com
Description:
------------
A string which includes the ? dagger symbol that is processed with htmlentities() with UTF-8 as the encoding results in the whole string being discarded and appearing as blank.

This is definitely a change in PHP 5.2.5. Tested on both Windows and Linux machines.

Reproduce code:
---------------
<?php echo htmlentities ('Test ?', ENT_COMPAT, 'UTF-8') . '<br />' . htmlentities ('Test', ENT_COMPAT, 'UTF-8'); ?>

Expected result:
----------------
Test ?
Test


[This is indeed the result as expected, on PHP v.5.2.4]

Actual result:
--------------
Test



[Blank line at start]

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-12-10 10:01 UTC] jani@php.net
Seems to work fine for me:

[jani@localhost ~]$ php t.php
Test &dagger;<br />Test[

Please try on command line.
 [2007-12-10 10:02 UTC] jani@php.net
Correct output:

$ php t.php
Test &dagger;<br />Test

 [2007-12-18 01:00 UTC] php-bugs at lists dot php dot net
No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
 [2008-01-22 14:55 UTC] tallyce at gmail dot com
I've been spending further time trying to work out what's happening, and am convinced something is definitely not right.

I've also found another character where the presence of the character results in the whole string disappearing, and there may be others.

Using this reproduce code:

<?php echo htmlentities ('Test ? ?', ENT_COMPAT, 'UTF-8') . '<br />' . preg_replace('/[^\x00-\x7F]/e', '"&#".ord("$0").";"', 'Test ? ?') . '<br />' . htmlentities ('Test', ENT_COMPAT, 'UTF-8') . '<br />'; ?>

I get different results for machines running SUSE Linux/PHP5.2.4, Linux Ubuntu/PHP 5.2.3 and WinXP/PHP 5.2.5. Only the second gives the result I would expect.





1. From a linux machine terminal:

Firstly doing
less t.php
gives
<?php echo htmlentities ('Test 233 206', ENT_COMPAT, 'UTF-8') . '<br />' . preg_replace('/[^\x00-\x7F]/e', '"&#".ord("$0").";"', 'Test 233 206') . '<
br />' . htmlentities ('Test', ENT_COMPAT, 'UTF-8') . '<br />'; ?>
with the 233 and 206 background-highlighted.


php -v
PHP 5.2.4 (cli) (built: Sep 12 2007 15:23:24)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies

Test <br />Test &#155; &#134;<br />Test<br />




2. From the same machine but viewing with a web browser (FF2.0.0.11/WinXP), i.e. example.com/t.php (which is serving up UTF-8 pages as confirmed by web-sniffer.net):

Test ? ?<br />Test &#155; &#134;<br />Test<br />

[two symbols appear as ? in diamond]



3. On another machine, with the putty terminal set to UTF-8:

less t.php
gives:
<?php echo htmlentities ('Test ? ?', ENT_COMPAT, 'UTF-8') . '<br />' . preg_replace('/[^\x00-\x7F]/e', '"&#".ord("$0").";"', 'Test ? ?') . '<br />' . htmlentities ('Test', ENT_COMPAT, 'UTF-8') . '<br />'; ?>
exactly as first entered.

php -v
PHP 5.2.3-1ubuntu6.2 (cli) (built: Dec  3 2007 19:59:42)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies

php t.php
Test &rsaquo; &dagger;<br />Test &#226;&#128;&#186; &#226;&#128;&#160;<br />Test<br />



4. Same machine as (3) but via web browser:

Test &rsaquo; &dagger;<br />Test &#226;&#128;&#186; &#226;&#128;&#160;<br />Test<br />



5. On a Windows machine

C:\Documents and Settings\username>php -v
PHP 5.2.5 (cli) (built: Nov  8 2007 23:18:51)
Copyright (c) 1997-2007 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2007 Zend Technologies

H:\>php t.php
PHP Warning:  htmlentities(): Invalid multibyte sequence in argument in H:\t.php on line 1
<br />Test &#155; &#134;<br />Test<br />



6. Same machine as (5) but via web browser

<br />Test &#155; &#134;<br />Test<br />
 [2008-01-29 14:57 UTC] rasmus@php.net
Just check to see if the dagger is properly represented as a UTF-8 character.  It should be e2 80 a0 
That same symbol can be represented in other encodings, obviously, but if you are telling htmlentities that you are using UTF-8 and you then pass it a dagger not encoded in UTF-8, it has no idea what to do with it.

To test it correctly, do this:

echo htmlentities(chr(0xe2).chr(0x80).chr(0xa0),null,'utf-8');

Spits out &dagger; then everything is fine, and the cases where it isn't working for you is because you aren't actually passing it the correct utf-8 sequence for that character.  I don't do Windows, but the above test works fine on Linux, FreeBSD and OSX for me.
 [2008-01-30 08:37 UTC] rasmus@php.net
Marking this as bogus for now.  If you can show that a properly UTF-8 encoded dagger, or some other properly encoded UTF-8 character isn't working, re-open it with that information.  Make sure you show the actual raw byte sequence that is being passed into the function.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Mar 29 07:01:28 2024 UTC