php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52971 PCRE-Meta-Characters not working with utf-8
Submitted: 2010-10-02 17:58 UTC Modified: 2010-10-04 22:11 UTC
From: marc dot bennewitz at giata dot de Assigned: felipe (profile)
Status: Closed Package: PCRE related
PHP Version: 5.3.3 OS: Linux
Private report: No CVE-ID: None
 [2010-10-02 17:58 UTC] marc dot bennewitz at giata dot de
Description:
------------
PCRE-Meta-Characters like \b \w not working with unicode strings.

PHP-5.3.3 (32Bit)
pcre

PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19

Directive => Local Value => Master Value
pcre.backtrack_limit => 100000 => 100000
pcre.recursion_limit => 100000 => 100000

iconv

iconv support => enabled
iconv implementation => glibc
iconv library version => 2.10.1

Directive => Local Value => Master Value
iconv.input_encoding => ISO-8859-1 => ISO-8859-1
iconv.internal_encoding => ISO-8859-1 => ISO-8859-1
iconv.output_encoding => ISO-8859-1 => ISO-8859-1


Test script:
---------------
<?php // encoding: UTF-8

$message = 'Der ist ein Süßwasserpool Süsswasserpool ... verschiedene Wassersportmöglichkeiten bei ...';

$pattern = '/\bwasser/iu';
preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE);
var_dump($match);

$pattern = '/[^\w]wasser/iu';
preg_match_all($pattern, $message, $match, PREG_OFFSET_CAPTURE);
var_dump($match);

Expected result:
----------------
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "Wasser"
      [1]=>
      int(61)
    }
  }
}
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(7) " Wasser"
      [1]=>
      int(60)
    }
  }
}

Actual result:
--------------
array(1) {
  [0]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "wasser"
      [1]=>
      int(17)
    }
    [1]=>
    array(2) {
      [0]=>
      string(6) "Wasser"
      [1]=>
      int(61)
    }
  }
}
array(1) {
  [0]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      string(8) "ßwasser"
      [1]=>
      int(15)
    }
    [1]=>
    array(2) {
      [0]=>
      string(7) " Wasser"
      [1]=>
      int(60)
    }
  }
}

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-10-02 20:26 UTC] cataphract@php.net
-Status: Open +Status: Bogus
 [2010-10-02 20:26 UTC] cataphract@php.net
This is by design, it's the way \b and \w are defined in PCRE.

You'll have to use another strategy, like look behind and unicode character properties.
 [2010-10-03 10:21 UTC] marc dot bennewitz at giata dot de
There are some problems with it:
1. On windows it works as expected
2. With Unicode properties there is no word boundary (\w \W)
3. With the modifier "u" php knows that the subject is UTF-8
4. http://php.net/manual/regexp.reference.escape.php there is no note for UTF-8 incompatibility

php.exe -i
...
iconv

iconv support => enabled
iconv implementation => "libiconv"
iconv library version => 1.11

Directive => Local Value => Master Value
iconv.input_encoding => ISO-8859-1 => ISO-8859-1
iconv.internal_encoding => ISO-8859-1 => ISO-8859-1
iconv.output_encoding => ISO-8859-1 => ISO-8859-1
...
pcre

PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 8.02 2010-03-19

Directive => Local Value => Master Value
pcre.backtrack_limit => 100000 => 100000
pcre.recursion_limit => 100000 => 100000
...
 [2010-10-03 11:02 UTC] cataphract@php.net
-Status: Bogus +Status: Re-Opened
 [2010-10-03 11:02 UTC] cataphract@php.net
I'm reopening as there's indeed a different behavior in Windows that I can't yet quite explain,
 [2010-10-03 18:01 UTC] felipe@php.net
Automatic comment from SVN on behalf of felipe
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=303963
Log: - Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8)
#   In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
#       characters, even in UTF-8 mode. However, this can be changed by setting
#       the PCRE_UCP option.
 [2010-10-03 18:02 UTC] felipe@php.net
-Status: Re-Opened +Status: Closed -Assigned To: +Assigned To: felipe
 [2010-10-03 18:02 UTC] felipe@php.net
This bug has been fixed in SVN.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

In the last version of PCRE was added a flag PCRE_UCP, as states the doc:
"In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII characters, even in UTF-8 mode. However, this can be changed by setting the PCRE_UCP option."

Setting the flag we got:
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(6) "Wasser"
      [1]=>
      int(61)
    }
  }
}
array(1) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(7) " Wasser"
      [1]=>
      int(60)
    }
  }
}
 [2010-10-04 22:11 UTC] marc dot bennewitz at giata dot de
now it works fine :)
thanks
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 08:01:29 2024 UTC