| 
        php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login | 
  [2020-11-13 13:34 UTC] benjamin dot morel at gmail dot com
 Description:
------------
When using \p{L} on accented letters, the resulting match is truncated and therefore invalid UTF-8.
Examples:
à = C3A0 is returned as C3
é = C3A9 is returned as C3
Test script:
---------------
$str = 'Voilà déjà';
display($str);
preg_match_all('/\p{L}+/', $str, $matches);
foreach ($matches[0] as $match) {
    display($match);
}
function display($str) {
    echo bin2hex($str), ' ';
    var_export(mb_check_encoding($str, 'UTF-8'));
    echo PHP_EOL;
}
Expected result:
----------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3a0 true
64c3a9 true
6ac3a0 true
Actual result:
--------------
566f696cc3a02064c3a96ac3a0 true
566f696cc3 false
64c3 false
6ac3 false
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits             
             | 
    |||||||||||||||||||||||||||
            
                 
                Copyright © 2001-2025 The PHP GroupAll rights reserved.  | 
        Last updated: Tue Nov 04 08:00:01 2025 UTC | 
Actually, "déjà" should be considered a single word with \p{L}, so the expected result should be: 566f696cc3a02064c3a96ac3a0 true 566f696cc3a0 true 64c3a96ac3a0 true