PHP :: Bug #66121 :: UTF-8 lookbehinds match bytes instead of characters

Bug #66121	UTF-8 lookbehinds match bytes instead of characters
Submitted:	2013-11-20 00:01 UTC	Modified:	2015-06-05 12:15 UTC
From:	danielklein at airpost dot net	Assigned:	cmb (profile)
Status:	Duplicate	Package:	PCRE related
PHP Version:	5.5.6	OS:
Private report:	No	CVE-ID:	None

View Developer Edit

[2013-11-20 00:01 UTC] danielklein at airpost dot net

Description:
------------
The test script appears to check every byte position within the UTF-8 encoded character and backtrack until a valid starting byte is found, then check it. If you remove the improperly inserted characters the resulting bytes do encode the original character correctly.

Once it has matched the beginning of the string it should then move ahead by a character, not by a byte.

Test script:
---------------
<?php
// Sinhala characters
print(preg_replace('/(?<!ක)/u', '*', 'ක') . "\n"); // Works properly
print(preg_replace('/(?<!ක)/u', '*', 'ම') . "\n"); // Triggers the bug
// English characters
print(preg_replace('/(?<!k)/u', '*', 'k') . "\n"); // Works properly
print(preg_replace('/(?<!k)/u', '*', 'm') . "\n"); // Works properly
?>


Expected result:
----------------
*ක
*ම*
*k
*m*

Actual result:
--------------
*ක
*�*�*�*
*k
*m*

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2013-11-20 03:07 UTC] rasmus@php.net

-Status: Open +Status: Feedback

[2013-11-20 03:07 UTC] rasmus@php.net

Please test this using the pcretest command line tool which is part of the PCRE package. If you can reproduce the issue there too, file the bug with the PCRE project. If you can't, we'll look into it further here.

[2013-11-20 23:31 UTC] danielklein at airpost dot net

-Status: Feedback +Status: Open

[2013-11-20 23:31 UTC] danielklein at airpost dot net

I originally thought it was a PCRE bug so I submitted one with them (see http://bugs.exim.org/show_bug.cgi?id=1416 for pcretest results). It has been confirmed to not be a PCRE bug so the bug must be in PHP.

[2013-11-22 04:14 UTC] danielklein at airpost dot net

The bug is also in preg_match_all(). See the following code:
<?php
preg_match_all('/(?<!ක)/u', 'ම', $matches, PREG_OFFSET_CAPTURE);
var_dump($matches);
?>

There should only be two matches, not four, (both empty strings) and the second one should have an offset of 3 (bytes).

[2014-12-15 11:19 UTC] nhahtdh at gmail dot com

The problem is most likely caused by advancing offset[1] (to which start_offset is assigned to and used in the next call to pcre_exec) by 1 byte, regardless of UTF-8 mode or not:

https://github.com/php/php-src/blob/bf59acdea75cf13d179f10ce89d296a30f38676d/ext/pcre/php_pcre.c#L1296

The code is in the branch where we found out that we are standing still due to zero-length match, and we can't find a non-zero-length match at the current position.

We need to check the actual mode and increment the offset by 1 UTF character if UTF-8 mode; otherwise, 1 byte increment as per usual.

[2014-12-16 03:28 UTC] danielklein at airpost dot net

Good find but it looks like that code is just for pcre_replace(). It needs to be fixed for all affected functions at the same time.

[2014-12-16 03:55 UTC] danielklein at airpost dot net

Same error seems to be on lines 838, 1296 & 1720. No idea if that would fix everything though.

[2014-12-16 11:01 UTC] nhahtdh at gmail dot com

In the "non-zero-width match and not reached limit" branch, if the startoffset is taken from the match result, then the offset is advanced correctly, except for the case of using \C (the behavior of PHP in such case is different from pcretest, but it should belong in a different bug report).

In the "zero-width match or limit == 0" branch, except for the case in question, we break out of the loop.

In the else branch, we also break out of the loop.

So I think everything should be fixed if we change this branch for all related functions.

[2015-01-08 22:24 UTC] cmbecker69 at gmx dot de

Rather interestingly, this seems to have worked before PHP
5.2.0: <http://3v4l.org/tL64Y>.

BTW: when talking about Unicode it might be best to avoid the term
"character". In this case it is about code points.

[2015-01-13 08:35 UTC] danielklein at airpost dot net

Although your test doesn't split the code point into bytes, it also does not give the correct result. There should be a second asterisk after the Sinhala letter.

[2015-06-05 12:15 UTC] cmb@php.net

-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb

[2015-06-05 12:15 UTC] cmb@php.net

As already been mentioned by nhahtdh at gmail dot com, this
problem has the same cause as bug #53823. Therefore I'm marking
this ticket as duplicate.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Wed Mar 04 23:00:02 2026 UTC