PHP :: Request #65323 :: improvement for counting ill-formed byte sequences

Request #65323	improvement for counting ill-formed byte sequences
Submitted:	2013-07-24 10:59 UTC	Modified:	2013-10-15 11:54 UTC
From:	masakielastic at gmail dot com	Assigned:	cataphract (profile)
Status:	No Feedback	Package:	Strings related
PHP Version:	5.5.1	OS:
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	masakielastic at gmail dot com
New email:
PHP Version:		OS:

New/Additional Comment:

[2013-07-24 10:59 UTC] masakielastic at gmail dot com

Description:
------------
Consider the number of substitute characters (U+FFFD)
when the range of UTF-8 string of second byte is narrow (such as 0xA0 - 0xBF)

//      Code Points   First Byte Second Byte Third Byte Fourth Byte
//   U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
//   U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
//  U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
// U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

If you follow the recommended policy describled in "Table 3-8. Use of U+FFFD in 
UTF-8 Conversion" of The Unicode Standard,
"\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD".
The actual result is "\xEF\xBF\xBD".

The one of solution for that purpose is introducing a macro that checks second 
byte by first byte.

https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p
atch
https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p
hp

Test script:
---------------
// https://bugs.php.net/bug.php?id=65081
function str_scrub($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

$ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD";
$ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD";

var_dump(
    $ufffd_x2 === str_scrub("\xE0\x80"),
    $ufffd_x3 === str_scrub("\xE0\x80\x80")
);

Expected result:
----------------
bool(true)
bool(true)

Actual result:
--------------
bool(false)
bool(false)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2013-07-24 11:20 UTC] masakielastic at gmail dot com

Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

[2013-07-26 00:43 UTC] yohgaki@php.net

Thank you for the report.
This seems good.

We are also discussing about mb_scrub() as mb_convert_encoding() alias. i.e. 
calling converter internally like mb_convert_encoding().

On master branch mbfl converter fix has been committed.
We appreciate if you could check the current implementation.

[2013-07-31 22:15 UTC] cataphract@php.net

-Assigned To: +Assigned To: cataphract

[2013-07-31 23:00 UTC] cataphract@php.net

-Status: Assigned +Status: Feedback

[2013-07-31 23:00 UTC] cataphract@php.net

Can you test this branch?

https://github.com/cataphract/php-src/compare/bug65323

I basically rewrote the parser; it was getting too complicated.

[2013-08-01 07:05 UTC] cataphract@php.net

Unfortunately, it's also significantly slower... I have to look more closely.

[2013-10-15 11:54 UTC] php-bugs at lists dot php dot net

No feedback was provided. The bug is being suspended because
we assume that you are no longer experiencing the problem.
If this is not the case and you are able to provide the
information that was requested earlier, please do so and
change the status of the bug back to "Re-Opened". Thank you.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 14:01:36 2025 UTC