PHP :: Doc Bug #72353 :: Misleading quote concerning PCRE in "UTF-8 mode"

Doc Bug #72353	Misleading quote concerning PCRE in "UTF-8 mode"
Submitted:	2016-06-07 08:37 UTC	Modified:	2016-06-07 14:28 UTC
From:	post at oliver-schieche dot de	Assigned:	cmb (profile)
Status:	Closed	Package:	PCRE related
PHP Version:	5.6.22	OS:	Any
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.

php.net Username: php.net Password:

Quick Fix:	(description)
	Block user comment
Status:		Assign to:
Package:
Bug Type:
Summary:
From:	post at oliver-schieche dot de
New email:
PHP Version:		OS:

New/Additional Comment:

[2016-06-07 08:37 UTC] post at oliver-schieche dot de

Description:
------------
The manual section "Character classes" of the PCRE documentation contains a quote from the original PCRE documentation: "In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

While for the original PCRE library this is true, it is, however, untrue concerning the PCRE implementation for PHP. The `u` modifier in PHP actually means PCRE_UTF8 | PCRE_UCP; the latter making character classes like [[:digit:]] also match e.g. Persian digits. Which certainly are above code-point 128.

This misleading quote should be removed from the manual and possibly be substituted for a note/warning that, when using the `u` modifier, character classes are changed. From the original PCRE docs (http://www.pcre.org/original/doc/html/pcrepattern.html#SEC10):

However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

  [:alnum:]  becomes  \p{Xan}
  [:alpha:]  becomes  \p{L}
  [:blank:]  becomes  \h
  [:digit:]  becomes  \p{Nd}
  [:lower:]  becomes  \p{Ll}
  [:space:]  becomes  \p{Xps}
  [:upper:]  becomes  \p{Lu}
  [:word:]   becomes  \p{Xwd}

A little more context and the answer inspiring this bug-report on Stackoverflow: http://stackoverflow.com/q/37658139/255756

Test script:
---------------
$string = 'I have ۳ apples and 5 oranges.';

preg_match_all('/\d+/u', $string, $m);
var_export($m);


Expected result:
----------------
// According to the docs, this should be dumped

array (
  0 => 
  array (
    0 => '5',
  ),
)

Actual result:
--------------
array (
  0 => 
  array (
    0 => '۳',
    1 => '5',
  ),
)

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2016-06-07 10:18 UTC] cmb@php.net

-Status: Open +Status: Verified -Assigned To: +Assigned To: cmb

[2016-06-07 10:18 UTC] cmb@php.net

Thanks for reporting this issue.

> The `u` modifier in PHP actually means PCRE_UTF8 | PCRE_UCP; […]

Note, that this is true only as of PHP 5.3[1] and libpcre 8.10[2],
though. Namely the latter is not guarenteed even with recent PHP
versions; PHP 7.0.7, for instance, still supports libpcre 6.6 or
higher[3].

[1] <https://github.com/php/php-src/blob/php-7.0.7/ext/pcre/php_pcre.c#L430-L432>
[2] <http://www.pcre.org/original/changelog.txt>
[3] <https://github.com/php/php-src/blob/PHP-7.0.7/ext/pcre/config0.m4#L42-L44>

[2016-06-07 14:28 UTC] cmb@php.net

Automatic comment from SVN on behalf of cmb
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=339309
Log: Fix #72353: Misleading quote concerning PCRE in &quot;UTF-8 mode&quot;

[2016-06-07 14:28 UTC] cmb@php.net

-Status: Verified +Status: Closed

[2016-06-07 14:28 UTC] cmb@php.net

This bug has been fixed in the documentation's XML sources. Since the
online and downloadable versions of the documentation need some time
to get updated, we would like to ask you to be a bit patient.

Thank you for the report, and for helping us make our documentation better.

[2016-06-22 16:03 UTC] cmb@php.net

Related To: Bug #67041

[2020-02-07 06:07 UTC] phpdocbot@php.net

Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=doc/en.git;a=commit;h=f41fee15b5968f724773b7eec1692734a27415b5
Log: Fix #72353: Misleading quote concerning PCRE in &quot;UTF-8 mode&quot;

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Mon Jul 14 20:01:55 2025 UTC