php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #80360 Unicode character property \p{L} truncates match
Submitted: 2020-11-13 13:34 UTC Modified: 2020-11-13 13:39 UTC
From: benjamin dot morel at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 7.4.12 OS: Ubuntu 18
Private report: No CVE-ID: None
 [2020-11-13 13:34 UTC] benjamin dot morel at gmail dot com
Description:
------------
When using \p{L} on accented letters, the resulting match is truncated and therefore invalid UTF-8.

Examples:

à = C3A0 is returned as C3
é = C3A9 is returned as C3


Test script:
---------------
$str = 'Voilà déjà';

display($str);

preg_match_all('/\p{L}+/', $str, $matches);

foreach ($matches[0] as $match) {
    display($match);
}

function display($str) {
    echo bin2hex($str), ' ';
    var_export(mb_check_encoding($str, 'UTF-8'));
    echo PHP_EOL;
}

Expected result:
----------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3a0 true
64c3a9 true
6ac3a0 true


Actual result:
--------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3 false
64c3 false
6ac3 false


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-11-13 13:37 UTC] benjamin dot morel at gmail dot com
Actually, "déjà" should be considered a single word with \p{L}, so the expected result should be:

566f696cc3a02064c3a96ac3a0 true
566f696cc3a0 true
64c3a96ac3a0 true
 [2020-11-13 13:39 UTC] nikic@php.net
-Status: Open +Status: Not a bug
 [2020-11-13 13:39 UTC] nikic@php.net
You are missing the /u modifier. Your input is not being treated as UTF-8.
 [2020-11-13 14:04 UTC] benjamin dot morel at gmail dot com
OMG, sorry about that. Thank you Nikita.
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Fri Jan 21 21:03:37 2022 UTC