PHP :: Bug #25670 :: cannot yet handle MBCS in html_entity

Bug #25670

cannot yet handle MBCS in html_entity_decode()

Submitted:

2003-09-26 09:24 UTC

Modified:

2003-09-27 06:11 UTC

Votes:	199
Avg. Score:	4.3 ± 1.0
Reproduced:	176 of 180 (97.8%)
Same Version:	71 (40.3%)
Same OS:	111 (63.1%)

From:

nospam at unclassified dot de

Assigned:

Status:

Wont fix

Package:

*General Issues

PHP Version:

4.3.2

OS:

Windows, Linux

Private report:

CVE-ID:

None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	nospam at unclassified dot de
New email:
PHP Version:		OS:

New Comment:

[2003-09-26 09:24 UTC] nospam at unclassified dot de

Description:
------------
Trying to decode HTML entities into UTF-8 results in the following error message:

Warning: cannot yet handle MBCS in html_entity_decode()!

The line is repeated about 200 times, then html_entity_decode just uses ISO-8859-1 charset.

Reproduce code:
---------------
echo html_entity_decode("&uuml;", ENT_QUOTES, "UTF-8");

Expected result:
----------------
some UTF-8 encoding of '?'

Actual result:
--------------
error messages see above, then Latin-1 encoding of '?'

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2003-09-26 09:30 UTC] nospam at unclassified dot de

Slightly correcting: It won't use Latin-1 but just do nothing. I tested with a Unicode character (&#x2248; "almost equal") and it returned the character readable.
Can be tested by passing html_entity_decode to htmlspecialchars and then echo'ing.

[2003-09-27 00:07 UTC] moriyoshi@php.net

The very issue was already addressed and the appropriate fix is ready for php5, though we won't introduce this feature to the current stable version (4.3.x).

See:
http://cvs.php.net/diff.php/php-src/NEWS?r1=1.1403&r2=1.1404

[2003-09-27 06:11 UTC] nospam at unclassified dot de

OK, I found another way for my issue anyway...

<?php
// Returns the utf string corresponding to the unicode value (from php.net, courtesy - romans@void.lv)
function code2utf($num)
{
	if ($num < 128) return chr($num);
	if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
	if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
	if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
	return '';
}

function encode($str)
{
	return preg_replace('/&#(\\d+);/e', 'code2utf($1)', utf8_encode($str));
}
?>

[2004-06-22 14:57 UTC] ross at golder dot org

Is there a particular reason you won't backport this fix to the 4.3 series? I can't find an explanation/discussion of this refusal. If somebody else backported this fix, would it be accepted? Sounds like quite a fairly important and useful bugfix to me.

Just curious. Currently downloading PHP5, will give it a spin.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Sat Jul 19 00:00:03 2025 UTC