PHP :: Bug #80360 :: Unicode character property \p{L} truncates match

Bug #80360	Unicode character property \p{L} truncates match
Submitted:	2020-11-13 13:34 UTC	Modified:	2020-11-13 13:39 UTC
From:	benjamin dot morel at gmail dot com	Assigned:
Status:	Not a bug	Package:	PCRE related
PHP Version:	7.4.12	OS:	Ubuntu 18
Private report:	No	CVE-ID:	None

View Developer Edit

[2020-11-13 13:34 UTC] benjamin dot morel at gmail dot com

Description:
------------
When using \p{L} on accented letters, the resulting match is truncated and therefore invalid UTF-8.

Examples:

à = C3A0 is returned as C3
é = C3A9 is returned as C3


Test script:
---------------
$str = 'Voilà déjà';

display($str);

preg_match_all('/\p{L}+/', $str, $matches);

foreach ($matches[0] as $match) {
    display($match);
}

function display($str) {
    echo bin2hex($str), ' ';
    var_export(mb_check_encoding($str, 'UTF-8'));
    echo PHP_EOL;
}

Expected result:
----------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3a0 true
64c3a9 true
6ac3a0 true


Actual result:
--------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3 false
64c3 false
6ac3 false

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2020-11-13 13:37 UTC] benjamin dot morel at gmail dot com

Actually, "déjà" should be considered a single word with \p{L}, so the expected result should be:

566f696cc3a02064c3a96ac3a0 true
566f696cc3a0 true
64c3a96ac3a0 true

[2020-11-13 13:39 UTC] nikic@php.net

-Status: Open +Status: Not a bug

[2020-11-13 13:39 UTC] nikic@php.net

You are missing the /u modifier. Your input is not being treated as UTF-8.

[2020-11-13 14:04 UTC] benjamin dot morel at gmail dot com

OMG, sorry about that. Thank you Nikita.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Tue Jul 01 18:01:35 2025 UTC