php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #19346 preg_match() does not work for UTF-8
Submitted: 2002-09-10 17:30 UTC Modified: 2002-09-14 19:13 UTC
From: gamid at isayev dot net Assigned:
Status: Closed Package: PCRE related
PHP Version: 4.3.2-dev OS: Mandrake 8.1
Private report: No CVE-ID: None
 [2002-09-10 17:30 UTC] gamid at isayev dot net
<?
$name = "Test".utf8_encode("\xDC");;
echo "name = '$name'<BR>\n";
echo preg_match("/^[[:alpha:]]+$/u", $name);
echo "<BR>\n";
?>
The above snippet does not produce expected result.
According to documentation, preg_match() should return a true value, but it returns false.
It looks as modifier 'u' does not work as it is described in the manual ( http://www.php.net/manual/en/pcre.pattern.modifiers.php ).

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2002-09-10 17:31 UTC] gamid at isayev dot net
BTW, it also does not work in PHP 4.3.2-dev,

PHP: 20020307
PHP Extension: 20020429
Zend Extension: 20020903
 [2002-09-10 18:47 UTC] wez@php.net
The manual says:

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. 

So, it's only the pattern string that is treated as
utf-8; the subject string is still treated as a sequence
of ascii bytes.

I have to admit that I thought that the subject would
be treated as utf-8 too, but that is not the case.
So, this is "bogus" (it sounds so harsh) as far as PHP
is concerned, although this is a valid feature request;
it should be taken up with the pcre library people.
 
 [2002-09-11 08:51 UTC] gamid at isayev dot net
But it does not make sense to specify UTF-8 in the pattern string if subject string may have only ASCII, does it?

BTW, how I can specify UTF-8 characters in the pattern string?
Let say I need write pattern string which match to _any_ UTF-8 alphabet character (like 'A-Za-z' for ASCII). How will  this pattern string look if [:alpha:] and \w do not work?
 [2002-09-11 09:48 UTC] wez@php.net
I meant 8bit instead of ascii.
This is not a problem with PHP but a problem with the pcre
library.  If the pcre library doesn't support it, neither
does PHP; so if you don't like it, complain to the pcre
people and when they have implemented this feature we can
bundle it with PHP.
So, this is bogus again.
 [2002-09-11 11:33 UTC] gamid at isayev dot net
The latest (3.9) PCRE library works fine with UTF-8 subject string if library compiled with --enable-utf8.
I have verified this with test C program.
I have also tested compiling PHP with PCRE library which I install separately on the my system (--with-pcre-regex=/usr/local/) and that worked as well with UTF-8.
It appears that PCRE library which came with PHP may be compiled without --enable-utf8 and this broke UTF-8 support in preg_match().
 [2002-09-11 12:50 UTC] wez@php.net
It's more likely that the newer version of pcre has actually fixed this bug.
I'll investigate updating the bundled version (it's best
to discuss this with the other php-dev people first).
Thanks for doing the legwork on this one,

--Wez.
 [2002-09-14 19:13 UTC] wez@php.net
We are now bundling PCRE 3.9 with PHP; I'm marking this
as closed since you reported that linking to that version
solved the problem.
Thanks!
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Wed Nov 13 16:01:27 2019 UTC