php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #17008 htmlentities() doesn't encode em or en dash
Submitted: 2002-05-04 23:10 UTC Modified: 2002-05-05 19:08 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: flaimo at gmx dot net Assigned:
Status: Closed Package: *General Issues
PHP Version: 4.2.0 OS: WinXP / Apache 1.3.24
Private report: No CVE-ID: None
 [2002-05-04 23:10 UTC] flaimo at gmx dot net
if i'm not wrong this function is supposed to encode all those special characters, right? well, em or en dashes are not encoded. the whole list of characters that should be encoded can be found here:
http://selfhtml.teamone.de/html/referenz/zeichen.htm#benannte_interpunktion

it's in german, but i guess you can see what i mean.

Patches

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-05-05 05:23 UTC] markonen@php.net
The code in ext/standard/html.c seems to only support 
entities found in the first 8 bits of a given charset, 
including utf-8. Windows code page 1252 is the only 
character set that has em and en dashes in this 8-bit area. 
Hence it is the only character set that will work like you 
expect it to. In other words, you need to use "cp1252" as 
the third argument to htmlentities() and make sure that 
your input string is in cp1252 as well.

Support for full utf-8 entities might be coming in a future 
release. Meanwhile, you can convert utf-8 to HTML's numeric 
character references with PHP's mbstring extension and this 
piece of code:

$f = 0xffff; $convmap = array(
/* <!ENTITY % HTMLlat1 PUBLIC 
    "-//W3C//ENTITIES Latin 1//EN//HTML"> %HTMLlat1; */
	 160,  255, 0, $f,
/* <!ENTITY % HTMLsymbol PUBLIC 
    "-//W3C//ENTITIES Symbols//EN//HTML"> %HTMLsymbol; */
	 402,  402, 0, $f,  913,  929, 0, $f,  931,  937, 0, $f,
	 945,  969, 0, $f,  977,  978, 0, $f,  982,  982, 0, $f,
	8226, 8226, 0, $f, 8230, 8230, 0, $f, 8242, 8243, 0, $f,
	8254, 8254, 0, $f, 8260, 8260, 0, $f, 8465, 8465, 0, $f,
	8472, 8472, 0, $f, 8476, 8476, 0, $f, 8482, 8482, 0, $f,
	8501, 8501, 0, $f, 8592, 8596, 0, $f, 8629, 8629, 0, $f,
	8656, 8660, 0, $f, 8704, 8704, 0, $f, 8706, 8707, 0, $f,
	8709, 8709, 0, $f, 8711, 8713, 0, $f, 8715, 8715, 0, $f,
	8719, 8719, 0, $f, 8721, 8722, 0, $f, 8727, 8727, 0, $f,
	8730, 8730, 0, $f, 8733, 8734, 0, $f, 8736, 8736, 0, $f,
	8743, 8747, 0, $f, 8756, 8756, 0, $f, 8764, 8764, 0, $f,
	8773, 8773, 0, $f, 8776, 8776, 0, $f, 8800, 8801, 0, $f,
	8804, 8805, 0, $f, 8834, 8836, 0, $f, 8838, 8839, 0, $f,
	8853, 8853, 0, $f, 8855, 8855, 0, $f, 8869, 8869, 0, $f,
	8901, 8901, 0, $f, 8968, 8971, 0, $f, 9001, 9002, 0, $f,
	9674, 9674, 0, $f, 9824, 9824, 0, $f, 9827, 9827, 0, $f,
	9829, 9830, 0, $f,    
/* <!ENTITY % HTMLspecial PUBLIC 
    "-//W3C//ENTITIES Special//EN//HTML"> %HTMLspecial; */
/* These ones are excluded to enable HTML: 34, 38, 60, 62 *
/
	 338,  339, 0, $f,  352,  353, 0, $f,  376,  376, 0, $f,
	 710,  710, 0, $f,  732,  732, 0, $f, 8194, 8195, 0, $f,
	8201, 8201, 0, $f, 8204, 8207, 0, $f, 8211, 8212, 0, $f,
	8216, 8218, 0, $f, 8218, 8218, 0, $f, 8220, 8222, 0, $f,
	8224, 8225, 0, $f, 8240, 8240, 0, $f, 8249, 8250, 0, $f,
	8364, 8364, 0, $f);
echo mb_encode_numericentity($html, $convmap, "UTF-8");
 [2002-05-05 13:41 UTC] flaimo at gmx dot net
well, then the it's not a bug, but then the documetation is a bit confusing:

"This function is identical to htmlspecialchars() in all ways, except that ALL characters which have HTML character entity equivalents are translated into these entities."

and for em-dash for example there's a &mdash; since html 4.0
 [2002-05-05 17:46 UTC] wez@php.net
The docs also say that iso-8859-1 charset is assumed; there
are no em or en dash characters in that charset.
But you are correct in that the docs are not up to date
with regards to which charsets and to what extent they are
supported.

 [2002-05-05 19:08 UTC] wez@php.net
I've committed a fix that adds support for the remaining html4 entities to the CVS HEAD.
You need to be using the utf-8 encoding for these characters to be detected/converted.
This change will be in 4.3.0
 [2002-05-05 20:23 UTC] flaimo at gmx dot net
great. thanks for putting it from the buglist to the wishlist :-)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Sep 09 06:01:27 2024 UTC