PHP :: Bug #25670 :: cannot yet handle MBCS in html_entity

Bug #25670

cannot yet handle MBCS in html_entity_decode()

Submitted:

2003-09-26 09:24 UTC

Modified:

2003-09-27 06:11 UTC

Votes:	199
Avg. Score:	4.3 ± 1.0
Reproduced:	176 of 180 (97.8%)
Same Version:	71 (40.3%)
Same OS:	111 (63.1%)

From:

nospam at unclassified dot de

Assigned:

Status:

Wont fix

Package:

*General Issues

PHP Version:

4.3.2

OS:

Windows, Linux

Private report:

CVE-ID:

None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	nospam at unclassified dot de
New email:
PHP Version:		OS:

New/Additional Comment:

[2003-09-26 09:24 UTC] nospam at unclassified dot de

Description:
------------
Trying to decode HTML entities into UTF-8 results in the following error message:

Warning: cannot yet handle MBCS in html_entity_decode()!

The line is repeated about 200 times, then html_entity_decode just uses ISO-8859-1 charset.

Reproduce code:
---------------
echo html_entity_decode("&uuml;", ENT_QUOTES, "UTF-8");

Expected result:
----------------
some UTF-8 encoding of '?'

Actual result:
--------------
error messages see above, then Latin-1 encoding of '?'

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2003-09-26 09:30 UTC] nospam at unclassified dot de

Slightly correcting: It won't use Latin-1 but just do nothing. I tested with a Unicode character (&#x2248; "almost equal") and it returned the character readable.
Can be tested by passing html_entity_decode to htmlspecialchars and then echo'ing.

[2003-09-27 00:07 UTC] moriyoshi@php.net

The very issue was already addressed and the appropriate fix is ready for php5, though we won't introduce this feature to the current stable version (4.3.x).

See:
http://cvs.php.net/diff.php/php-src/NEWS?r1=1.1403&r2=1.1404

[2003-09-27 06:11 UTC] nospam at unclassified dot de

OK, I found another way for my issue anyway...

<?php
// Returns the utf string corresponding to the unicode value (from php.net, courtesy - romans@void.lv)
function code2utf($num)
{
	if ($num < 128) return chr($num);
	if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
	if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
	if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
	return '';
}

function encode($str)
{
	return preg_replace('/&#(\\d+);/e', 'code2utf($1)', utf8_encode($str));
}
?>

[2004-06-22 14:57 UTC] ross at golder dot org

Is there a particular reason you won't backport this fix to the 4.3 series? I can't find an explanation/discussion of this refusal. If somebody else backported this fix, would it be accepted? Sounds like quite a fairly important and useful bugfix to me.

Just curious. Currently downloading PHP5, will give it a spin.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Sun Feb 15 08:00:01 2026 UTC