php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #25670 cannot yet handle MBCS in html_entity_decode()
Submitted: 2003-09-26 09:24 UTC Modified: 2003-09-27 06:11 UTC
Votes:199
Avg. Score:4.3 ± 1.0
Reproduced:176 of 180 (97.8%)
Same Version:71 (40.3%)
Same OS:111 (63.1%)
From: nospam at unclassified dot de Assigned:
Status: Wont fix Package: *General Issues
PHP Version: 4.3.2 OS: Windows, Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: nospam at unclassified dot de
New email:
PHP Version: OS:

 

 [2003-09-26 09:24 UTC] nospam at unclassified dot de
Description:
------------
Trying to decode HTML entities into UTF-8 results in the following error message:

Warning: cannot yet handle MBCS in html_entity_decode()!

The line is repeated about 200 times, then html_entity_decode just uses ISO-8859-1 charset.

Reproduce code:
---------------
echo html_entity_decode("ü", ENT_QUOTES, "UTF-8");

Expected result:
----------------
some UTF-8 encoding of '?'

Actual result:
--------------
error messages see above, then Latin-1 encoding of '?'

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-09-26 09:30 UTC] nospam at unclassified dot de
Slightly correcting: It won't use Latin-1 but just do nothing. I tested with a Unicode character (≈ "almost equal") and it returned the character readable.
Can be tested by passing html_entity_decode to htmlspecialchars and then echo'ing.
 [2003-09-27 00:07 UTC] moriyoshi@php.net
The very issue was already addressed and the appropriate fix is ready for php5, though we won't introduce this feature to the current stable version (4.3.x).

See:
http://cvs.php.net/diff.php/php-src/NEWS?r1=1.1403&r2=1.1404


 [2003-09-27 06:11 UTC] nospam at unclassified dot de
OK, I found another way for my issue anyway...

<?php
// Returns the utf string corresponding to the unicode value (from php.net, courtesy - romans@void.lv)
function code2utf($num)
{
	if ($num < 128) return chr($num);
	if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
	if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
	if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
	return '';
}

function encode($str)
{
	return preg_replace('/&#(\\d+);/e', 'code2utf($1)', utf8_encode($str));
}
?>
 [2004-06-22 14:57 UTC] ross at golder dot org
Is there a particular reason you won't backport this fix to the 4.3 series? I can't find an explanation/discussion of this refusal. If somebody else backported this fix, would it be accepted? Sounds like quite a fairly important and useful bugfix to me.

Just curious. Currently downloading PHP5, will give it a spin.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 15:01:30 2024 UTC