php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #72353 Misleading quote concerning PCRE in "UTF-8 mode"
Submitted: 2016-06-07 08:37 UTC Modified: 2016-06-07 14:28 UTC
From: post at oliver-schieche dot de Assigned: cmb (profile)
Status: Closed Package: PCRE related
PHP Version: 5.6.22 OS: Any
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: post at oliver-schieche dot de
New email:
PHP Version: OS:

 

 [2016-06-07 08:37 UTC] post at oliver-schieche dot de
Description:
------------
The manual section "Character classes" of the PCRE documentation contains a quote from the original PCRE documentation: "In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

While for the original PCRE library this is true, it is, however, untrue concerning the PCRE implementation for PHP. The `u` modifier in PHP actually means PCRE_UTF8 | PCRE_UCP; the latter making character classes like [[:digit:]] also match e.g. Persian digits. Which certainly are above code-point 128.

This misleading quote should be removed from the manual and possibly be substituted for a note/warning that, when using the `u` modifier, character classes are changed. From the original PCRE docs (http://www.pcre.org/original/doc/html/pcrepattern.html#SEC10):

However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

  [:alnum:]  becomes  \p{Xan}
  [:alpha:]  becomes  \p{L}
  [:blank:]  becomes  \h
  [:digit:]  becomes  \p{Nd}
  [:lower:]  becomes  \p{Ll}
  [:space:]  becomes  \p{Xps}
  [:upper:]  becomes  \p{Lu}
  [:word:]   becomes  \p{Xwd}

A little more context and the answer inspiring this bug-report on Stackoverflow: http://stackoverflow.com/q/37658139/255756

Test script:
---------------
$string = 'I have ۳ apples and 5 oranges.';

preg_match_all('/\d+/u', $string, $m);
var_export($m);


Expected result:
----------------
// According to the docs, this should be dumped

array (
  0 => 
  array (
    0 => '5',
  ),
)

Actual result:
--------------
array (
  0 => 
  array (
    0 => '۳',
    1 => '5',
  ),
)


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-06-07 10:18 UTC] cmb@php.net
-Status: Open +Status: Verified -Assigned To: +Assigned To: cmb
 [2016-06-07 10:18 UTC] cmb@php.net
Thanks for reporting this issue.

> The `u` modifier in PHP actually means PCRE_UTF8 | PCRE_UCP; […]

Note, that this is true only as of PHP 5.3[1] and libpcre 8.10[2],
though. Namely the latter is not guarenteed even with recent PHP
versions; PHP 7.0.7, for instance, still supports libpcre 6.6 or
higher[3].

[1] <https://github.com/php/php-src/blob/php-7.0.7/ext/pcre/php_pcre.c#L430-L432>
[2] <http://www.pcre.org/original/changelog.txt>
[3] <https://github.com/php/php-src/blob/PHP-7.0.7/ext/pcre/config0.m4#L42-L44>
 [2016-06-07 14:28 UTC] cmb@php.net
Automatic comment from SVN on behalf of cmb
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=339309
Log: Fix #72353: Misleading quote concerning PCRE in &quot;UTF-8 mode&quot;
 [2016-06-07 14:28 UTC] cmb@php.net
-Status: Verified +Status: Closed
 [2016-06-07 14:28 UTC] cmb@php.net
This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Wed Oct 16 04:01:26 2019 UTC