php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #80360 Unicode character property \p{L} truncates match
Submitted: 2020-11-13 13:34 UTC Modified: 2020-11-13 13:39 UTC
From: benjamin dot morel at gmail dot com Assigned:
Status: Not a bug Package: PCRE related
PHP Version: 7.4.12 OS: Ubuntu 18
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: benjamin dot morel at gmail dot com
New email:
PHP Version: OS:

 

 [2020-11-13 13:34 UTC] benjamin dot morel at gmail dot com
Description:
------------
When using \p{L} on accented letters, the resulting match is truncated and therefore invalid UTF-8.

Examples:

à = C3A0 is returned as C3
é = C3A9 is returned as C3


Test script:
---------------
$str = 'Voilà déjà';

display($str);

preg_match_all('/\p{L}+/', $str, $matches);

foreach ($matches[0] as $match) {
    display($match);
}

function display($str) {
    echo bin2hex($str), ' ';
    var_export(mb_check_encoding($str, 'UTF-8'));
    echo PHP_EOL;
}

Expected result:
----------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3a0 true
64c3a9 true
6ac3a0 true


Actual result:
--------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3 false
64c3 false
6ac3 false


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2020-11-13 13:37 UTC] benjamin dot morel at gmail dot com
Actually, "déjà" should be considered a single word with \p{L}, so the expected result should be:

566f696cc3a02064c3a96ac3a0 true
566f696cc3a0 true
64c3a96ac3a0 true
 [2020-11-13 13:39 UTC] nikic@php.net
-Status: Open +Status: Not a bug
 [2020-11-13 13:39 UTC] nikic@php.net
You are missing the /u modifier. Your input is not being treated as UTF-8.
 [2020-11-13 14:04 UTC] benjamin dot morel at gmail dot com
OMG, sorry about that. Thank you Nikita.
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Wed Jan 15 13:01:29 2025 UTC