php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #63671 Create a mb_str_word_count() function which is multi-byte aware
Submitted: 2012-12-03 03:09 UTC Modified: 2016-06-30 10:47 UTC
Votes:24
Avg. Score:4.5 ± 0.8
Reproduced:21 of 21 (100.0%)
Same Version:11 (52.4%)
Same OS:10 (47.6%)
From: kobrien at kiva dot org Assigned:
Status: Open Package: mbstring related
PHP Version: 5.5.0alpha1 OS: Ubuntu 12.04
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: kobrien at kiva dot org
New email:
PHP Version: OS:

 

 [2012-12-03 03:09 UTC] kobrien at kiva dot org
Description:
------------
Create a mb_str_word_count() function which will properly handle counting the 
number of words in string that contains multi-byte characters. This is currently 
not possible with str_word_count() because of use of the isalpha() C function 
which does not properly handle multi-byte characters.

As suggested by aharvey, this new function would replace usage of isalpha() with 
iswalpha(). 

A naive (meaning no real knowledge of this or testing of it) patch would look 
like:

diff --git a/ext/standard/string.c b/ext/standard/string.c
index 7a4ae2e..9ab6b5f 100644
--- a/ext/standard/string.c
+++ b/ext/standard/string.c
@@ -5202,7 +5202,7 @@ PHP_FUNCTION(str_word_count)
 
        while (p < e) {
                s = p;
-               while (p < e && (isalpha((unsigned char)*p) || (char_list && 
ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) {
+               while (p < e && (iswalpha((unsigned char)*p) || (char_list && 
ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) {
                        p++;
                }
                if (p > s) {


Test script:
---------------
<?php
// existing str_word_count function for comparison
print str_word_count("PHP function str_word_count does not properly handle non-latin characters") . "\n";
// returns 11
print str_word_count("Хабилло житель Яванского района. Ему 70 лет. Он женат. У него четверо детей. Хабилло филолог. Он более двадцати лет работает по профессии. Также Хабилло занимается виноградарством. У него имеется небольшой виноградник. Этим видом деятельности Хабилло занимается 15 лет.");
// returns 0

// new function mb_str_word_count
print mb_str_word_count("Хабилло житель Яванского района. Ему 70 лет. Он женат. У него четверо детей. Хабилло филолог. Он более двадцати лет работает по профессии. Также Хабилло занимается виноградарством. У него имеется небольшой виноградник. Этим видом деятельности Хабилло занимается 15 лет.");
// returns 37

Expected result:
----------------
Using mb_str_word_count() will return the number of words in a string containing 
multibyte characters

Actual result:
--------------
Currently there is no mb_str_word_count() function. Using str_word_count() on a 
string with multibyte characters returns 0.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-12-03 03:14 UTC] aharvey@php.net
-Package: *Unicode Issues +Package: mbstring related
 [2016-06-30 10:47 UTC] cmb@php.net
It occurs to me that an mb_str_word_count() should not use
iswalpha(), because the latter is depending on the current system
locale, but mbstring usually doesn't. That doesn't mean that a
locale aware str_word_count() wouldn't be useful, but one could
easily implement an mb_str_word_count() in userland. Simplified
and non-optimized:

  function mb_str_word_count($string) {
      return count(mb_split('[\s_"], $string));
  }
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Nov 25 10:01:32 2024 UTC