php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #78245 preg_split('/\R/', 'Техни') wrongly splits text into 2 parts
Submitted: 2019-07-03 13:50 UTC Modified: 2019-07-03 13:59 UTC
From: bugs_php at zarevak dot net Assigned:
Status: Verified Package: PCRE related
PHP Version: 7.3.6 OS: any (tested Windows, Linux)
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: bugs_php at zarevak dot net
New email:
PHP Version: OS:

 

 [2019-07-03 13:50 UTC] bugs_php at zarevak dot net
Description:
------------
According to https://www.php.net/manual/en/regexp.reference.escape.php the \R should match line-breaks (\n, \r and \r\n), but here it matches something else and breaks the string into two. When using expression \r\n|\n|\r directly, it works correctly.

Adding 'u' pattern modifier fixes the issue, but checks validity of the incoming UTF-8 string, which I do not want. The Russian text 'Техни' in this example does not contain \r or \n characters even when encoded using UTF-8 so this should not apply. The \R splits the string in the middle of the 'х' (U+0445 CYRILLIC SMALL LETTER HA) character, which is encoded in UTF-8 as bytes #D1 #85.

This is either bug:
1] in implementation of \R, where it incorrectly matches different bytes then 13, 10 and their combination
2] or in documentation, where it should state what other characters it can match (in this example it seems, it matches U+0085 <control> : NEXT LINE [NEL])

Test script:
---------------
<?php
$string="Реконструкция\r\nРеконструкция - Служба Технической поддержки\r\n";
$array = preg_split('/\R/', $string);  // BUG!
$array2 = preg_split('/\r\n|\n|\r/', $string); //OK :)
var_dump($array, $array2, ($array===$array2));

Expected result:
----------------
array(3) {
  [0]=>
  string(26) "Реконструкция"
  [1]=>
  string(83) "Реконструкция - Служба Технической поддержки"
  [2]=>
  string(0) ""
}
array(3) {
  [0]=>
  string(26) "Реконструкция"
  [1]=>
  string(83) "Реконструкция - Служба Технической поддержки"
  [2]=>
  string(0) ""
}
bool(true)

Actual result:
--------------
array(4) {
  [0]=>
  string(26) "Реконструкция"
  [1]=>
  string(47) "Реконструкция - Служба Те"
  [2]=>
  string(35) "нической поддержки"
  [3]=>
  string(0) ""
}
array(3) {
  [0]=>
  string(26) "Реконструкция"
  [1]=>
  string(83) "Реконструкция - Служба Технической поддержки"
  [2]=>
  string(0) ""
}
bool(false)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-07-03 13:54 UTC] nikic@php.net
-Type: Bug +Type: Documentation Problem
 [2019-07-03 13:54 UTC] nikic@php.net
\R matches any Unicode line break. To only match \r, \n and \r\n the (*BSR_ANYCRLF) mode needs to be used.
 [2019-07-03 13:59 UTC] cmb@php.net
-Status: Open +Status: Verified
 [2019-07-03 13:59 UTC] cmb@php.net
From the PCRE2 docs[1]:

| In 8-bit non-UTF-8 mode \R is equivalent to the following:
|
|  (?>\r\n|\n|\x0b|\f|\r|\x85)

[1] <https://www.pcre.org/current/doc/html/pcre2pattern.html#newlineseq>
 [2019-07-03 14:02 UTC] danack@php.net
For anyone looking to update the manual, this section from the PCRE docs is probably relevant, and non-trivial: https://www.pcre.org/original/doc/html/pcrepattern.html#newlineseq

"Newline sequences 

Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the following:

  (?>\r\n|\n|\x0b|\f|\r|\x85)

This is an example of an "atomic group", details of which are given below. This particular group matches either the two-character sequence CR followed by LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next line, U+0085). The two-character sequence is treated as a single unit that cannot be split.
In other modes, two additional characters whose codepoints are greater than 255 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). Unicode character property support is not needed for these characters to be recognized.

It is possible to restrict \R to match only CR, LF, or CRLF (instead of the complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. (BSR is an abbrevation for "backslash R".) This can be made the default when PCRE is built; if this is the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option. It is also possible to specify these settings by starting a pattern string with one of the following sequences:

  (*BSR_ANYCRLF)   CR, LF, or CRLF only
  (*BSR_UNICODE)   any Unicode newline sequence

These override the default and the options given to the compiling function, but they can themselves be overridden by options given to a matching function. Note that these special settings, which are not Perl-compatible, are recognized only at the very start of a pattern, and that they must be in upper case. If more than one of them is present, the last one is used. They can be combined with a change of newline convention; for example, a pattern can start with:

  (*ANY)(*BSR_ANYCRLF)

They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or (*UCP) special sequences. Inside a character class, \R is treated as an unrecognized escape sequence, and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is set."
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Tue Dec 10 02:01:24 2019 UTC