php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30382 preg_match and utf-8
Submitted: 2004-10-10 15:49 UTC Modified: 2005-02-12 22:00 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: TiloLutz at gmx dot de Assigned: derick (profile)
Status: Not a bug Package: PCRE related
PHP Version: 4.3.9 OS: Suse Linux 9.1
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: TiloLutz at gmx dot de
New email:
PHP Version: OS:

 

 [2004-10-10 15:49 UTC] TiloLutz at gmx dot de
Description:
------------
preg_match doesn't work correct when utf-8 is used

preg_match('/^([[:alpha:]])*$/u', "?")
should return true because [[:alpha:]] contains
also localized special characters like ???.
Unfortunatly it returns false.

It works with iso-8859-15 but doesn't work with utf-8


Reproduce code:
---------------
putenv("LANG=de_DE");
setlocale(LC_ALL, "de_DE");
if (preg_match('/^([[:alpha:]])*$/u', "?") echo "true";

putenv("LANG=de_DE.utf8");
setlocale(LC_ALL, "de_DE.utf8");
if (preg_match('/^([[:alpha:]])*$/u', "?") echo "true";


Expected result:
----------------
true
true



Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-10-10 16:00 UTC] aidan@php.net
Reassigning to proper category.
 [2004-10-10 16:20 UTC] tony2001@php.net
"PCRE related" is the right category for this report.

 [2004-10-11 08:01 UTC] derick@php.net
This depends on how the รค is encoded in your script. If it's just iso-8859-1 then it won't work. No bug here unless you can come up with an example that works. (Post a link to a zip file containing your scripts).
 [2004-10-11 11:41 UTC] TiloLutz at gmx dot de
You can find an example at 
http://www.stud.uni-karlsruhe.de/~usjp/preg_match.zip 
 
The file is 100% encoded as utf8.
 [2004-10-11 12:40 UTC] chregu@php.net
I can reproduce it on Debian, but not on Mac OS X
 [2004-10-11 13:00 UTC] derick@php.net
I can reproduce it too, but I need to think real hard about it first before I can say whether it is correct or not :)
 [2004-12-06 22:08 UTC] pmichaud at pobox dot com
It might help to know that PCRE doesn't support the [:alpha:], [:digit:], etc. classes in UTF-8 mode.  From http://www.pcre.org/pcre.txt, under "POSIX CHARACTER CLASSES":

   In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes.

So, the fact that [:alpha:] doesn't work on UTF-8 strings appears to be a limitation of PCRE itself.  (And I do so strongly wish it were otherwise.)

Pm
 [2005-02-12 22:00 UTC] tony2001@php.net
PCRE has limited UTF-8 support.
"A  class  is  matched  against a UTF-8 character instead of just a single byte, but it can match only characters whose values are less than 256. Characters with greater values always fail to match a class." (c) man pcre

No bug in PHP -> bogus.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Mar 28 21:01:27 2024 UTC