php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52971 PCRE-Meta-Characters not working with utf-8
Submitted: 2010-10-02 17:58 UTC Modified: 2010-10-04 22:11 UTC
From: marc dot bennewitz at giata dot de Assigned: felipe (profile)
Status: Closed Package: PCRE related
PHP Version: 5.3.3 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: marc dot bennewitz at giata dot de
New email:
PHP Version: OS:

 

 [2010-10-02 17:58 UTC] marc dot bennewitz at giata dot de
Description:
------------
PCRE-Meta-Characters like \b \w not working with unicode strings.

PHP-5.3.3 (32Bit)
pcre

PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19

Directive => Local Value => Master Value
pcre.backtrack_limit => 100000 => 100000
pcre.recursion_limit => 100000 => 100000

iconv

iconv support => enabled
iconv implementation => glibc
iconv library version => 2.10.1

Directive => Local Value => Master Value
iconv.input_encoding => ISO-8859-1 => ISO-8859-1
iconv.internal_encoding => ISO-8859-1 => ISO-8859-1
iconv.output_encoding => ISO-8859-1 => ISO-8859-1


Test script:
---------------
<?php // encoding: UTF-8

$message = 'Der ist ein Süßwasserpool Süsswasserpool ... verschiedene Wassersportmöglichkeiten bei ...';

$pattern = '/\bwasser/iu';
preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE);
var_dump($match);

$pattern = '/[^\w]wasser/iu';
preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE);
var_dump($match);

Expected result:
----------------
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "Wasser"
      [1]=>
      int(61)
    }
  }
}
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(7) " Wasser"
      [1]=>
      int(60)
    }
  }
}

Actual result:
--------------
array(1) {
  [0]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "wasser"
      [1]=>
      int(17)
    }
    [1]=>
    array(2) {
      [0]=>
      string(6) "Wasser"
      [1]=>
      int(61)
    }
  }
}
array(1) {
  [0]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      string(8) "ßwasser"
      [1]=>
      int(15)
    }
    [1]=>
    array(2) {
      [0]=>
      string(7) " Wasser"
      [1]=>
      int(60)
    }
  }
}

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-10-02 20:26 UTC] cataphract@php.net
-Status: Open +Status: Bogus
 [2010-10-02 20:26 UTC] cataphract@php.net
This is by design, it's the way \b and \w are defined in PCRE.

You'll have to use another strategy, like look behind and unicode character properties.
 [2010-10-03 10:21 UTC] marc dot bennewitz at giata dot de
There are some problems with it:
1. On windows it works as expected
2. With Unicode properties there is no word boundary (\w \W)
3. With the modifier "u" php knows that the subject is UTF-8
4. http://php.net/manual/regexp.reference.escape.php there is no note for UTF-8 incompatibility

php.exe -i
...
iconv

iconv support => enabled
iconv implementation => "libiconv"
iconv library version => 1.11

Directive => Local Value => Master Value
iconv.input_encoding => ISO-8859-1 => ISO-8859-1
iconv.internal_encoding => ISO-8859-1 => ISO-8859-1
iconv.output_encoding => ISO-8859-1 => ISO-8859-1
...
pcre

PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19

Directive => Local Value => Master Value
pcre.backtrack_limit => 100000 => 100000
pcre.recursion_limit => 100000 => 100000
...
 [2010-10-03 11:02 UTC] cataphract@php.net
-Status: Bogus +Status: Re-Opened
 [2010-10-03 11:02 UTC] cataphract@php.net
I'm reopening as there's indeed a different behavior in Windows that I can't yet quite explain,
 [2010-10-03 18:01 UTC] felipe@php.net
Automatic comment from SVN on behalf of felipe
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=303963
Log: - Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8)
#   In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
#       characters, even in UTF-8 mode. However, this can be changed by setting
#       the PCRE_UCP option.
 [2010-10-03 18:02 UTC] felipe@php.net
-Status: Re-Opened +Status: Closed -Assigned To: +Assigned To: felipe
 [2010-10-03 18:02 UTC] felipe@php.net
This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

In the last version of PCRE was added a flag PCRE_UCP, as states the doc:
"In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII characters, even in UTF-8 mode. However, this can be changed by setting the PCRE_UCP option."

Setting the flag we got:
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "Wasser"
      [1]=>
      int(61)
    }
  }
}
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(7) " Wasser"
      [1]=>
      int(60)
    }
  }
}
 [2010-10-04 22:11 UTC] marc dot bennewitz at giata dot de
now it works fine :)
thanks
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Thu Jan 30 13:01:29 2025 UTC