|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #72353 Misleading quote concerning PCRE in "UTF-8 mode"
Submitted: 2016-06-07 08:37 UTC Modified: 2016-06-07 14:28 UTC
From: post at oliver-schieche dot de Assigned: cmb (profile)
Status: Closed Package: PCRE related
PHP Version: 5.6.22 OS: Any
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
Solve the problem:
2 + 36 = ?
Subscribe to this entry?

 [2016-06-07 08:37 UTC] post at oliver-schieche dot de
The manual section "Character classes" of the PCRE documentation contains a quote from the original PCRE documentation: "In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

While for the original PCRE library this is true, it is, however, untrue concerning the PCRE implementation for PHP. The `u` modifier in PHP actually means PCRE_UTF8 | PCRE_UCP; the latter making character classes like [[:digit:]] also match e.g. Persian digits. Which certainly are above code-point 128.

This misleading quote should be removed from the manual and possibly be substituted for a note/warning that, when using the `u` modifier, character classes are changed. From the original PCRE docs (

However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

  [:alnum:]  becomes  \p{Xan}
  [:alpha:]  becomes  \p{L}
  [:blank:]  becomes  \h
  [:digit:]  becomes  \p{Nd}
  [:lower:]  becomes  \p{Ll}
  [:space:]  becomes  \p{Xps}
  [:upper:]  becomes  \p{Lu}
  [:word:]   becomes  \p{Xwd}

A little more context and the answer inspiring this bug-report on Stackoverflow:

Test script:
$string = 'I have ۳ apples and 5 oranges.';

preg_match_all('/\d+/u', $string, $m);

Expected result:
// According to the docs, this should be dumped

array (
  0 => 
  array (
    0 => '5',

Actual result:
array (
  0 => 
  array (
    0 => '۳',
    1 => '5',


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2016-06-07 10:18 UTC]
-Status: Open +Status: Verified -Assigned To: +Assigned To: cmb
 [2016-06-07 10:18 UTC]
Thanks for reporting this issue.

> The `u` modifier in PHP actually means PCRE_UTF8 | PCRE_UCP; […]

Note, that this is true only as of PHP 5.3[1] and libpcre 8.10[2],
though. Namely the latter is not guarenteed even with recent PHP
versions; PHP 7.0.7, for instance, still supports libpcre 6.6 or

[1] <>
[2] <>
[3] <>
 [2016-06-07 14:28 UTC]
Automatic comment from SVN on behalf of cmb
Log: Fix #72353: Misleading quote concerning PCRE in &quot;UTF-8 mode&quot;
 [2016-06-07 14:28 UTC]
-Status: Verified +Status: Closed
 [2016-06-07 14:28 UTC]
This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.
 [2020-02-07 06:07 UTC]
Automatic comment on behalf of cmb
Log: Fix #72353: Misleading quote concerning PCRE in &quot;UTF-8 mode&quot;
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Sat Nov 27 20:03:20 2021 UTC