php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52333 Metacharacter \d in a regexp causes an error on some Russian letters
Submitted: 2010-07-14 07:03 UTC Modified: 2010-07-16 07:01 UTC
From: a dot dobkin at drweb dot com Assigned:
Status: Not a bug Package: *Regular Expressions
PHP Version: 5.3.2 OS: Windows
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: a dot dobkin at drweb dot com
New email:
PHP Version: OS:

 

 [2010-07-14 07:03 UTC] a dot dobkin at drweb dot com
Description:
------------
Metacharacter \d in a regular expression causes an error on some Russian letters 
on OS Windows. 

Example script:

$user_name_ru = "Василий";
$regexp = "/[\d\!\@\#\%\$\^&*\(\)\~\=\/\|\"\'\?\:\;\/]+/";
if( preg_match( $regexp,$user_name_ru ) ) {
 echo 'ERR';
} else {
 echo 'OK';
}

preg_match() return true if word contains one or more characters 'й', 'г', 'в'. 
If to delete metacharacter '\d' preg_match() returns false.   If you are using 
php version 5.2.13 all works correctly.


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-07-14 07:32 UTC] a dot dobkin at drweb dot com
OS 2003 Server R2 SP2 English x86
 [2010-07-16 07:01 UTC] aharvey@php.net
-Status: Open +Status: Bogus
 [2010-07-16 07:01 UTC] aharvey@php.net
This is an encoding issue, rather than a bug in PHP itself: by
default, preg_match() works like most things in PHP and just treats
strings as a series of bytes. If Василий is encoded in UTF-16,
there are multiple bytes in the range that are digits in ASCII, so
\d matches them.

preg_match() does have support for Unicode text when it's encoded
as UTF-8 via the /u modifier, so the right way to handle this would
be using iconv() or mb_convert_encoding() to convert the string to
UTF-8, then using a regex like
"/[\d\!\@\#\%\$\^&*\(\)\~\=\/\|\"\'\?\:\;\/]+/u" to force UTF-8
mode.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Dec 11 19:01:27 2024 UTC