php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #19346 preg_match() does not work for UTF-8
Submitted: 2002-09-10 17:30 UTC Modified: 2002-09-14 19:13 UTC
From: gamid at isayev dot net Assigned:
Status: Closed Package: PCRE related
PHP Version: 4.3.2-dev OS: Mandrake 8.1
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: gamid at isayev dot net
New email:
PHP Version: OS:

 

 [2002-09-10 17:30 UTC] gamid at isayev dot net
<?
$name = "Test".utf8_encode("\xDC");;
echo "name = '$name'<BR>\n";
echo preg_match("/^[[:alpha:]]+$/u", $name);
echo "<BR>\n";
?>
The above snippet does not produce expected result.
According to documentation, preg_match() should return a true value, but it returns false.
It looks as modifier 'u' does not work as it is described in the manual ( http://www.php.net/manual/en/pcre.pattern.modifiers.php ).

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-09-10 17:31 UTC] gamid at isayev dot net
BTW, it also does not work in PHP 4.3.2-dev,

PHP: 20020307
PHP Extension: 20020429
Zend Extension: 20020903
 [2002-09-10 18:47 UTC] wez@php.net
The manual says:

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. 

So, it's only the pattern string that is treated as
utf-8; the subject string is still treated as a sequence
of ascii bytes.

I have to admit that I thought that the subject would
be treated as utf-8 too, but that is not the case.
So, this is "bogus" (it sounds so harsh) as far as PHP
is concerned, although this is a valid feature request;
it should be taken up with the pcre library people.
 
 [2002-09-11 08:51 UTC] gamid at isayev dot net
But it does not make sense to specify UTF-8 in the pattern string if subject string may have only ASCII, does it?

BTW, how I can specify UTF-8 characters in the pattern string?
Let say I need write pattern string which match to _any_ UTF-8 alphabet character (like 'A-Za-z' for ASCII). How will  this pattern string look if [:alpha:] and \w do not work?
 [2002-09-11 09:48 UTC] wez@php.net
I meant 8bit instead of ascii.
This is not a problem with PHP but a problem with the pcre
library.  If the pcre library doesn't support it, neither
does PHP; so if you don't like it, complain to the pcre
people and when they have implemented this feature we can
bundle it with PHP.
So, this is bogus again.
 [2002-09-11 11:33 UTC] gamid at isayev dot net
The latest (3.9) PCRE library works fine with UTF-8 subject string if library compiled with --enable-utf8.
I have verified this with test C program.
I have also tested compiling PHP with PCRE library which I install separately on the my system (--with-pcre-regex=/usr/local/) and that worked as well with UTF-8.
It appears that PCRE library which came with PHP may be compiled without --enable-utf8 and this broke UTF-8 support in preg_match().
 [2002-09-11 12:50 UTC] wez@php.net
It's more likely that the newer version of pcre has actually fixed this bug.
I'll investigate updating the bundled version (it's best
to discuss this with the other php-dev people first).
Thanks for doing the legwork on this one,

--Wez.
 [2002-09-14 19:13 UTC] wez@php.net
We are now bundling PCRE 3.9 with PHP; I'm marking this
as closed since you reported that linking to that version
solved the problem.
Thanks!
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Fri Mar 14 18:01:30 2025 UTC