php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #39744 UTF8 support
Submitted: 2006-12-05 15:48 UTC Modified: 2006-12-05 17:33 UTC
From: sdamir at gmail dot com Assigned:
Status: Not a bug Package: *Regular Expressions
PHP Version: 5.2.0 OS: Linux 2.6.18
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: sdamir at gmail dot com
New email:
PHP Version: OS:

 

 [2006-12-05 15:48 UTC] sdamir at gmail dot com
Description:
------------
I am trying to match all alphabetic utf8 characters. I know (tested) that in perl if $string is utf8 encoded and if i use regex like =~ /\w/ it will match all alphabetic utf8 characters, (cirilic alphabet, chinese, english etc.). However this is not the case for php. I read i need to use special patterns like \pL , well this doesn't work for me either, it matches some characters but cirilic letters aren't matched. I don't know if this is a bug or i am doing something wrong but i really searched the hell out of everything, visited tons of irc support channels no one has an answer to this.

Reproduce code:
---------------
<?php 

// setlocale(LC_ALL, 'en_US.utf8'); // if i set locale to en_US, it matches some characters like ??? but not rilic, en_US.utf8 wont match anything.

$str=" &#1057;&#1088;&#1077;&#1115;&#1072; ";

utf8_encode($str);

var_dump($str); 
preg_match("/[\w\pL]/u",$str, $r); 
var_dump($r);

?> 

Expected result:
----------------
string(3) " s "
array(1) {
  [0]=>
  string(1) "&#1057;"
}


Actual result:
--------------
string(12) " &#1057;&#1088;&#1077;&#1115;&#1072; "
array(0) {
}


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-12-05 15:51 UTC] sdamir at gmail dot com
I dont know why but your bug-system converted letters in my php code into &#crap; stuff.
 [2006-12-05 16:09 UTC] tony2001@php.net
http://php.net/utf8_encode

utf8_encode -- Encodes an **ISO-8859-1** string to UTF-8
 [2006-12-05 17:32 UTC] sdamir at gmail dot com
That's...not what i want, i don't see how that is relevant because string is already utf8 encoded. My question is simple, how do you match all UTF8 alphabetic characters?
 [2006-12-05 17:33 UTC] sdamir at gmail dot com
Damn, i just realised i used "utf8_encode($str);" in source code i submitted, please ignore that line!
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Mon Mar 10 21:01:30 2025 UTC