php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #52333 Metacharacter \d in a regexp causes an error on some Russian letters
Submitted: 2010-07-14 07:03 UTC Modified: 2010-07-16 07:01 UTC
From: a dot dobkin at drweb dot com Assigned:
Status: Not a bug Package: *Regular Expressions
PHP Version: 5.3.2 OS: Windows
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: a dot dobkin at drweb dot com
New email:
PHP Version: OS:

 

 [2010-07-14 07:03 UTC] a dot dobkin at drweb dot com
Description:
------------
Metacharacter \d in a regular expression causes an error on some Russian letters 
on OS Windows. 

Example script:

$user_name_ru = "Василий";
$regexp = "/[\d\!\@\#\%\$\^&*\(\)\~\=\/\|\"\'\?\:\;\/]+/";
if( preg_match( $regexp,$user_name_ru ) ) {
 echo 'ERR';
} else {
 echo 'OK';
}

preg_match() return true if word contains one or more characters 'й', 'г', 'в'. 
If to delete metacharacter '\d' preg_match() returns false.   If you are using 
php version 5.2.13 all works correctly.


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-07-14 07:32 UTC] a dot dobkin at drweb dot com
OS 2003 Server R2 SP2 English x86
 [2010-07-16 07:01 UTC] aharvey@php.net
-Status: Open +Status: Bogus
 [2010-07-16 07:01 UTC] aharvey@php.net
This is an encoding issue, rather than a bug in PHP itself: by
default, preg_match() works like most things in PHP and just treats
strings as a series of bytes. If Василий is encoded in UTF-16,
there are multiple bytes in the range that are digits in ASCII, so
\d matches them.

preg_match() does have support for Unicode text when it's encoded
as UTF-8 via the /u modifier, so the right way to handle this would
be using iconv() or mb_convert_encoding() to convert the string to
UTF-8, then using a regex like
"/[\d\!\@\#\%\$\^&*\(\)\~\=\/\|\"\'\?\:\;\/]+/u" to force UTF-8
mode.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 17:01:58 2024 UTC