php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #43549 changes made to htmlentities
Submitted: 2007-12-09 23:59 UTC Modified: 2008-06-10 15:33 UTC
Votes:14
Avg. Score:4.6 ± 0.6
Reproduced:13 of 13 (100.0%)
Same Version:9 (69.2%)
Same OS:3 (23.1%)
From: mariusads at helpedia dot com Assigned: stas (profile)
Status: Wont fix Package: Strings related
PHP Version: 5.2.5 OS: Redhat?, Linux
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2007-12-09 23:59 UTC] mariusads at helpedia dot com
Description:
------------
I run a website that accepts game cheats submissions from users and displays them in categories and so on.
User submits .txt files which are saved on the driver, a certain page on the website reads the text file or a fragment of it, performs htmlentities on it and displays it on the screen.

Recently, the hosting company upgraded PHP to PHP 5.2.5 and with htmlentities returned an empty string when trying to escape it.

I understand this is probably because of that fix regarding multi-byte characters in string, making htmlentities ignore input.
That seems dumb a bit, shouldn't it return at least a string part that's before that multibyte character?

Anyway, the file submitted is plain text and I honestly don't know what  characters are wrong, that it would make htmlentities to ignore the text.
The file is uploaded here: http://www.tgdb.net/a.txt

In the scripts I have the following code:

function htmlesc($text)
{ 
$s = html_entity_decode($text,ENT_QUOTES,'UTF-8');
return htmlentities($s,ENT_QUOTES,'UTF-8');}
}

The text passes html_entity_decode with no problems but htmlentities returns empty string.

If possible, could you please tell me how could I check in the future if a string contains multibyte characters, so that i don't have this problem?

Right now, the only solution the hosting company gave to me was to add a rule in .htaccess which makes the server process the PHP files with PHP4.

Thank you for your help.
Marius Hudea

PS. The captcha doesn't seem to work right, I'm sure I didn't get the captcha wrong 8 times in a row

Reproduce code:
---------------
I've used the code below uploaded on several web servers to test:

<html><body>
<?
$text = $_REQUEST['text'];
echo htmlentities($text,ENT_QUOTES,'UTF-8');
?>
<form name="A" method="post">
<textarea name="text"></textarea>
<input name="sub" type="submit" value="submit"/>
</form>
</body></html>

Test file: http://www.tgdb.net/a.txt

Expected result:
----------------
Expected to have the text displayed on the screen, to have the function return a non-empty string.
Expected at least a partial string, up to that error, not having to check scripts for 5 minutes to see what went wrong.

Actual result:
--------------
Copy and paste text from a.txt results in an empty string.
Any other text is processed correctly.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-12-10 09:32 UTC] jani@php.net
Works fine for me. Are you sure you have everything as utf-8..ie. the page you're sending the form from has content-type set to utf-8 ?
 [2007-12-10 11:24 UTC] mariusads at helpedia dot com
Here are several pages that show this problem with htmlentities:

hxtp://www.tgdb.net/pc/cheats/19556/18_Wheels_of_Steel_Convoy-page1.html
hxtp://www.tgdb.net/pc/faq/5845/Diablo-page1.html

The content on the second link worked fine up until the PHP version was upgraded.

This page and lots of other work:

hxtp://www.tgdb.net/pc/faq/5841/Diablo-page1.html

So it's not a badly coded script in the sense that it worked as I planned.

You can see the text right before it's being sent to htmlentities in all pages in a html comment, you just have to view the source (with the only difference that I've replaced '--' with '==' as -- is not allowed in comments.

When I reported the problem to the hosting company, I have uploaded the test script written in the first post on two of their servers and a server from Dreamhost.
PHP 5.2.5 hxtp://www.helpedia.com/test2.php
PHP 5.2.5 hxtp://www.tgdb.net/test2.php
PHP 5.2.2 hxtp://www.definethis.org/test2.php

I've opened the file a.txt in Firefox, pressed Ctrl+A to select all text, copied to Clipboard and pasted it to the form. Result is an empty string on PHP 5.2.5 and the correct string on PHP 5.2.2. Correct result also on my work computer with PHP 5.2.4

I didn't manage to download 5.2.5 on my work computer and test it, so I guess it could be a bad build on the hosting company's servers. Will try in the following hour.

(replace hxtp with http, this page thinks I'm spamming)
 [2007-12-10 11:45 UTC] mariusads at helpedia dot com
Just downloaded on my computer (Windows 2003, PHP 5.2.5 from website) and the same problem occurs.

For example this one works: 

hxtp://devtgdb.definethis.org:90/pc/faq/5842/Diablo-page1.html

but this one doesn't:

hxtp://devtgdb.definethis.org:90/pc/faq/5845/Diablo-page1.html

The source code is identical, only difference is ads are disabled from site config.
Also, if the links don't work, sorry, you may read this while I'm sleeping and my computer is turned off. Otherwise, it's cable 4mbps/512kbps so they should work.

(again, please replace hxtp with http)
 [2007-12-11 12:28 UTC] jani@php.net
You never specified the charset for the page. This works fine:

<html>
<head> 
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<pre>
<?php
$text = isset($_REQUEST['text']) ? $_REQUEST['text'] : '';
  
var_dump($text);
var_dump(htmlentities($text,ENT_QUOTES,'UTF-8'));

?>
</pre>
<form name="A" method="post">
<textarea name="text"></textarea>
<input name="sub" type="submit" value="submit"/>
</form>
</body></html>

 [2007-12-11 13:23 UTC] mariusads at helpedia dot com
The example you gave me does work but my issue has nothing to do with receiving data in textarea or input boxes.

The page here:
hxxp://www.tgdb.net/pc/faq/5845/Diablo-page1.html

tries to open a text document from a certain location on the drive which was previously uploaded by the user in a zip file.

If I change the example you gave me to read the text from a file:

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
<pre>
<?php
$text = file_get_contents('a.txt');
  
var_dump($text);
var_dump(htmlentities($text,ENT_QUOTES,'UTF-8'));

?>
</pre>
<form name="A" method="post">
<textarea name="text"></textarea>
<input name="sub" type="submit" value="submit"/>
</form>
</body></html>

which basically gets the text from the text file, htmlentities no longer works as I expected it. I guess with your example, the browser corrects the text pasted in the form before sending it to the script.

Almost all files I have are like the NFO files in scene releases, they contain drawings made with ASCII characters, but I need them escaped because the whole page is sent as UTF-8. They don't have multibyte stuff on purpose and they're not corrupted or something like that.

It's also not feasible to tell the user to paste them in a text area in a form, almost all submitted content is sent in a ZIP file and my scripts extract the text file. Some files are also about 4-600KB in size (walkthroughs) so using text area is out of the question.

I'm surprised to see that htmlentities returns a blank string and I honestly don't know how to fix this so any help would be great. Shouldn't it return the string up until the point where multibyte characters are, whatever that means?

Could my problem be solved if I don't use the charset argument or if I use htmlspecialchars? (i'll test...) Or is there another solution to my predicament?

Thank you again for replying to my questions. I had bad experiences in the past where other people just ignored my messages after replying with "works for me". I really appreciate it.
 [2007-12-11 13:33 UTC] mariusads at helpedia dot com
small update: problem is solved kind of) if I don't specify 'UTF-8' or if I use htmlspecialchars without 'UTF-8' (though the document looks different than before, but I guess I'll just have to replace those ascii characters with images, probably).
 [2008-01-03 11:10 UTC] adam at shiftcreate dot com
I have the same problem - worked before 5.2.5
 [2008-01-13 17:30 UTC] yossi_shelli at cso dot co dot il
i have the same problem.

I'm using apache server on windows xp to develop my website.
while developing i had php 5.2.2 installed and all worked. today, while showing the site to a friend of mine, i found out the text outputs, of the user comments are not showing.

after playing a-bit. i found out the the htmlentities is the problem, which worked fine before.

also, found out, that removing UTF-8 , will show text, but not in the encoding which is needed to my website, which is hebrew.

from the first place, htmlentities , did not had a hebrew chars support, so i used utf-8 which worked fine for me. now, utf-8 setting on htmlentities resulting in NULL strings, and i don't have a solution to my problem.
 [2008-01-14 08:36 UTC] s-beutel at gmx dot de
Hi,

I confirm the very same issue for PHP 5.2.1/Apache2/RedHat. 

- has nothing to do with the browser encoding or GET'ed/POST'ed variables, since I simply convert a static string
- seems to be installation specific, since it runs perfectly on my windows box (PHP 5.2.0)
- I have the idea - but no evidence yet - that it's an older issue: for almost one year I tried to fix an issue with a tiny webshop which is an outcome of this, and which some users have been complaining about every now and then (obviously, without debugging or narrower information)

Example skript: http://sbeutel.sb.ohost.de/trans.php
Plain Text code: http://sbeutel.sb.ohost.de/trans.txt

It simply encodes the string aou_??? with various settings, and htmlentities($str,ENT_QUOTES,'utf-8'); spits out just nothing as soon as non-ASCII characters (german umlauts, in this case) are contained in the string.

Hope this helps. Contact me if I may provide more information.
 [2008-01-24 20:54 UTC] tallyce at gmail dot com
See also bugs 43294 and 43896 which seem to be the same thing.

This is really starting to bite now. Please can this be fixed, or
suggest how we can reliably process incoming user data in UTF8 given
this behaviour change!

I concur this seems to be installation specific and earlier than 5.2.5 as shown in bug 43294.
 [2008-01-28 23:32 UTC] rasmus@php.net
It comes down to what to do with invalid input.  We can't let invalid UTF-8 through, because if you do, your site will be insecure.  Before this fix, your site was actually open to XSS exploits since you were spitting invalid UTF-8 chars out on a page marked as UTF-8 and that confuses IE.

I suppose we could change htmlentities to just strip the invalid chars, but from a security perspective that is typically not the right approach.  You can strip the invalid utf-8 chars yourself with: 

  $str = iconv('utf-8','utf-8',$str);

 [2008-01-29 14:31 UTC] tallyce at gmail dot com
Thanks, but see
http://bugs.php.net/43294

which shows that the dagger character (and others) results in the whole string disappearing, on some installations at least.

I thought the dagger character was a valid UTF8 string, or would a submission of that character be considered "invalid input"?
 [2008-01-29 21:13 UTC] rasmus@php.net
As I commented in that bug, assuming you are passing in that character properly encoded, it will work.  Nothing in that bug report shows an actual problem as you don't show the exact byte sequence you are passing in.
 [2008-06-10 15:33 UTC] stas@php.net
As function seems to work as intended and there's other way for sanitizing utf-8, I'm marking it as wontfix for now, unless any new info arrives. 
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Wed Jul 17 10:01:26 2019 UTC