php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #39744 UTF8 support
Submitted: 2006-12-05 15:48 UTC Modified: 2006-12-05 17:33 UTC
From: sdamir at gmail dot com Assigned:
Status: Not a bug Package: *Regular Expressions
PHP Version: 5.2.0 OS: Linux 2.6.18
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: sdamir at gmail dot com
New email:
PHP Version: OS:

 

 [2006-12-05 15:48 UTC] sdamir at gmail dot com
Description:
------------
I am trying to match all alphabetic utf8 characters. I know (tested) that in perl if $string is utf8 encoded and if i use regex like =~ /\w/ it will match all alphabetic utf8 characters, (cirilic alphabet, chinese, english etc.). However this is not the case for php. I read i need to use special patterns like \pL , well this doesn't work for me either, it matches some characters but cirilic letters aren't matched. I don't know if this is a bug or i am doing something wrong but i really searched the hell out of everything, visited tons of irc support channels no one has an answer to this.

Reproduce code:
---------------
<?php 

// setlocale(LC_ALL, 'en_US.utf8'); // if i set locale to en_US, it matches some characters like ??? but not rilic, en_US.utf8 wont match anything.

$str=" &#1057;&#1088;&#1077;&#1115;&#1072; ";

utf8_encode($str);

var_dump($str); 
preg_match("/[\w\pL]/u",$str, $r); 
var_dump($r);

?> 

Expected result:
----------------
string(3) " s "
array(1) {
  [0]=>
  string(1) "&#1057;"
}


Actual result:
--------------
string(12) " &#1057;&#1088;&#1077;&#1115;&#1072; "
array(0) {
}


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-12-05 15:51 UTC] sdamir at gmail dot com
I dont know why but your bug-system converted letters in my php code into &#crap; stuff.
 [2006-12-05 16:09 UTC] tony2001@php.net
http://php.net/utf8_encode

utf8_encode -- Encodes an **ISO-8859-1** string to UTF-8
 [2006-12-05 17:32 UTC] sdamir at gmail dot com
That's...not what i want, i don't see how that is relevant because string is already utf8 encoded. My question is simple, how do you match all UTF8 alphabetic characters?
 [2006-12-05 17:33 UTC] sdamir at gmail dot com
Damn, i just realised i used "utf8_encode($str);" in source code i submitted, please ignore that line!
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Fri Dec 03 17:03:34 2021 UTC