| Bug #25670 | cannot yet handle MBCS in html_entity_decode() | ||||
|---|---|---|---|---|---|
| Submitted: | 26 Sep 2003 9:24am UTC | Modified: | 27 Sep 2003 6:11am UTC | ||
| From: | nospam at unclassified dot de | Assigned to: | |||
| Status: | Wont fix | Category: | *General Issues | ||
| Version: | 4.3.2 | OS: | Windows, Linux | ||
| Votes: | 192 | Avg. Score: | 4.3 ± 1.0 | Reproduced: | 170 of 174 (97.7%) |
| Same Version: | 69 (40.6%) | Same OS: | 107 (62.9%) | ||
[26 Sep 2003 9:30am UTC] nospam at unclassified dot de
Slightly correcting: It won't use Latin-1 but just do nothing. I tested with a Unicode character (≈ "almost equal") and it returned the character readable. Can be tested by passing html_entity_decode to htmlspecialchars and then echo'ing.
[27 Sep 2003 12:07am UTC] moriyoshi@php.net
The very issue was already addressed and the appropriate fix is ready for php5, though we won't introduce this feature to the current stable version (4.3.x). See: http://cvs.php.net/diff.php/php-src/NEWS?r1=1.1403&r2=1.1404
[27 Sep 2003 6:11am UTC] nospam at unclassified dot de
OK, I found another way for my issue anyway...
<?php
// Returns the utf string corresponding to the unicode value (from
php.net, courtesy - romans@void.lv)
function code2utf($num)
{
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) +
128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) &
63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12)
& 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function encode($str)
{
return preg_replace('/&#(\\d+);/e', 'code2utf($1)',
utf8_encode($str));
}
?>
[22 Jun 2004 2:57pm UTC] ross at golder dot org
Is there a particular reason you won't backport this fix to the 4.3 series? I can't find an explanation/discussion of this refusal. If somebody else backported this fix, would it be accepted? Sounds like quite a fairly important and useful bugfix to me. Just curious. Currently downloading PHP5, will give it a spin.

Description: ------------ Trying to decode HTML entities into UTF-8 results in the following error message: Warning: cannot yet handle MBCS in html_entity_decode()! The line is repeated about 200 times, then html_entity_decode just uses ISO-8859-1 charset. Reproduce code: --------------- echo html_entity_decode("ü", ENT_QUOTES, "UTF-8"); Expected result: ---------------- some UTF-8 encoding of 'ü' Actual result: -------------- error messages see above, then Latin-1 encoding of 'ü'