PHP :: Bug #62861 :: htmlentities returns empty string when it shouldn't

Bug #62861	htmlentities returns empty string when it shouldn't
Submitted:	2012-08-19 04:14 UTC	Modified:	2012-08-19 16:37 UTC
From:	soapergem at gmail dot com	Assigned:
Status:	Not a bug	Package:	*General Issues
PHP Version:	5.4.6	OS:	Windows
Private report:	No	CVE-ID:	None

View Developer Edit

[2012-08-19 04:14 UTC] soapergem at gmail dot com

Description:
------------
Doesn't UTF-8 include basic ASCII characters, too? Right now when I try to encode the copyright symbol (©) using htmlentities (it should encode to &copy;), it doesn't work. I discovered this since the default encoding for htmlentities() was switched from ISO-8859-1 to UTF-8 in version 5.4.

I have plenty of places where I rely on basic symbols, such as the copyright symbol, being encoded properly with htmlentities(). Having to go in and change all the instances of htmlentities($string) to htmlentities($string, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1') is not practical (there are MANY). And with the whole output of the function being blank, it just makes my scripts completely unusable now.

Help!

Test script:
---------------
<?php

echo htmlentities('©', ENT_COMPAT | ENT_HTML401, 'UTF-8');

?>

Expected result:
----------------
&copy;

Actual result:
--------------
(Nothing - an empty string)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2012-08-19 04:38 UTC] rasmus@php.net

-Status: Open +Status: Not a bug

[2012-08-19 04:38 UTC] rasmus@php.net

UTF-8 is only compatible with low-ascii, not with high. The copyright symbol in 
ISO-8859-1 is character code (in hex) <A9>. In UTF-8 the copyright symbol is 
represented by two bytes, <C2><A9>. The world has gone UTF-8. If your editor is 
in UTF-8 mode and you enter/paste a copyright symbol and pass it to 
htmlentities() you will get "&copy;" back. So rather than change the code to 
hardcode ISO-8859-1 you should convert your datasources to UTF-8. Most of them 
are probably already UTF-8 which means that your current code was likely not 
handling these correctly since it assumed ISO-8859-1 before.

For some perspetive: 
http://w3techs.com/technologies/overview/character_encoding/all
which shows that 72% of the top-million sites on the Web are using UTF-8. And 
this number is growing.

[2012-08-19 05:02 UTC] soapergem at gmail dot com

With respect, the 72% figure you cited is misleading at best. The character 
encoding listed in the HTML gives no indication of what encoding the files were 
actually saved in. All it is is a <meta> tag in that <head> that says UTF-8. I 
would suspect the vast majority of those files are still saved in ISO-8859-1, 
though.

My prediction is that you're going to get A LOT of complaints over the switch -- 
especially from Windows users, who almost always save things in ISO-8859-1, 
since that is the default encoding in Windows. With PHP on Windows ever growing, 
fighting the Windows users is just shooting yourself in the foot.

[2012-08-19 05:22 UTC] rasmus@php.net

I think you are confusing CP-1252 with ISO-8859-1. And the default on Windows 
internally is actually UTF-16 but there is a library call named isTextUnicode() 
which most apps use to determine which encoding something is in and it tends 
towards CP-1252 if it can't figure it out, so I assume that is what you mean 
when you say everyone saves things in ISO-8859-1 on Windows. Every editor I know 
of has a very simple encoding setting to force the editor to a specific 
encoding. Set it to UTF-8 and all your problems will go away. Note also that CP-
1252 is not used in most of the world, so this assertion that most pages are 
saved in ISO-8859-1 is obviously not true. Regardless, this is not something 
that will be reverted. CP-1252 is disappearing and I think you will find much 
less of it in Windows8 as it really doesn't play well with HTML5.

[2012-08-19 13:30 UTC] soapergem at gmail dot com

Yes, your assumptions about what I was meaning to say were correct. I really 
meant "ANSI," which you know as CP-1252.

But there is definitely still a bug with this. I just followed your instructions 
by saving my test script specifically in the "UTF-8" encoding hoping that, as 
you said, "all my problems will go away."

They didn't.

My test script is exactly the same one that I have listed on this bug report. I 
saved it in Windows Notepad, using the "UTF-8" encoding. I am no longer getting 
an empty string -- which is progress. But now I am getting the following output:

ï»¿&copy;

This is definitely NOT the expected result here. It did finally convert the 
copyright symbol, but it prepended not one, not two, but THREE junk characters 
in front of it. This is even worse than before.

If I'm not mistaken, wasn't the whole reason PHP6 was abandoned because the idea 
of converting everything to Unicode deemed too ambitious? I've already spent far 
too much time dealing with this than is practical, as I'm sure you have much 
better things to do, as well. It just seems to me that you guys had a wonderful 
hammer -- a wonderful tool for the job -- and you went and broke off the hammer 
head for no apparent reason.

If I might make a humble suggestion, why not let htmlentities() default to 
whatever the default_charset option is in php.ini? Right now you can only do 
that by explicitly passing an empty string as the third parameter to 
htmlentities, which is very messy and counterintuitive. Shouldn't the 
default_charset actually be, you know, the _default character set_?

[2012-08-19 13:49 UTC] rasmus@php.net

From my command line:

php > echo htmlentities('©', ENT_COMPAT | ENT_HTML401, 'UTF-8');
&copy;

it works fine. If you are actually providing the correct UTF-8 char it will work 
fine. You can verify that by doing this:

php > $a = chr(0xC2).chr(0xA9);
php > echo htmlentities($a, ENT_COMPAT | ENT_HTML401, 'UTF-8');
&copy;

Here I am explicitly passing C2A9 in and I get &copy; back out.

So I have no idea what your Windows Notepad is doing. Look at the output with a 
hex editor and see what it is converting that copyright character to.

[2012-08-19 13:59 UTC] nikic@php.net

Save your document as UTF-8 *without* BOM. The ï»¿ is just what the UTF-8 Byte Order Mark (BOM) looks like when it is output (which is probably something you don't want, so save the file without it).

[2012-08-19 14:11 UTC] soapergem at gmail dot com

There is no option to save without the BOM in Windows Notepad. Nor is there an 
option to save with/without the BOM in many other Windows editors. It is 
automatically added to the file and there is nothing I can do about that -- 
short of writing a script to programmatically go through all my other scripts 
with fopen(), remove the first three characters, and then re-save.

That is NOT a practical option. PHP should be handling this.

As it stands, PHP 5.4 is completely unusable. Until you guys fix this, I need to 
stick with 5.3, because 5.4 will break all of my scripts -- and all the scripts 
of ANYONE who uses htmlentities() on a Windows server. Please take my suggestion 
about using the default_charset to heart. That would finally resolve this issue.

[2012-08-19 14:27 UTC] nikic@php.net

Windows Notepad does not support this because Notepad is not a suitable editor for development. All development-oriented texteditors and IDEs support saving files without BOM.

One commonly used text editor for Windows is Notepad++ (in case you don't want to use a full-blown IDE).

[2012-08-19 14:27 UTC] rasmus@php.net

Every real editor can do that. Windows Notepad is not a real editor. Notepad++ 
(which is free and much much better than Notepad), Notepad2, Textmate, Vim, 
Jedit, Ultraedit, Emacs, SourceEdit can all do this.

[2012-08-19 14:31 UTC] soapergem at gmail dot com

I am aware that Notepad is not a suitable editor for development. It is just the 
de facto "basic" editor in Windows. If something doesn't work in Notepad, you're 
usually in trouble.

I use an editor called EditPlus, which is a very good editor. The older version 
which I have used does not have support for removing the BOM, but I see the 
newer version does, so I will have to upgrade.

But I would really appreciate it if you could address my suggestion about using 
the default_charset defined in php.ini automatically. Right now having to call 
htmlentities($string, ENT_COMPAT | ENT_HTML401, "") seems very counter-intuitive 
to invoke what should be the default.

[2012-08-19 14:57 UTC] nikic@php.net

The default_charset sets default charset for the Content-Type header. It doesn't really have anything to do with the htmlspecialchars() family of functions.

The '' encoding is some sort of magic charset detection algorithm that may or may not guess correctly. The docs explicitly state that you should not use it.

[2012-08-19 16:37 UTC] soapergem at gmail dot com

That makes sense.

In that case, could I submit a feature request to add a config option to php.ini 
called "default_encoding"? By default (or if omitted) it would be UTF-8, of 
course. This would allow users to change it one place (or change it via ini_set) 
to set the default for the htmlspecialchars family of functions, rather than 
having to grep all the code to change each function call.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 05:01:42 2025 UTC