php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #63671 Create a mb_str_word_count() function which is multi-byte aware
Submitted: 2012-12-03 03:09 UTC Modified: 2016-06-30 10:47 UTC
Votes:24
Avg. Score:4.5 ± 0.8
Reproduced:21 of 21 (100.0%)
Same Version:11 (52.4%)
Same OS:10 (47.6%)
From: kobrien at kiva dot org Assigned:
Status: Open Package: mbstring related
PHP Version: 5.5.0alpha1 OS: Ubuntu 12.04
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: kobrien at kiva dot org
New email:
PHP Version: OS:

 

 [2012-12-03 03:09 UTC] kobrien at kiva dot org
Description:
------------
Create a mb_str_word_count() function which will properly handle counting the 
number of words in string that contains multi-byte characters. This is currently 
not possible with str_word_count() because of use of the isalpha() C function 
which does not properly handle multi-byte characters.

As suggested by aharvey, this new function would replace usage of isalpha() with 
iswalpha(). 

A naive (meaning no real knowledge of this or testing of it) patch would look 
like:

diff --git a/ext/standard/string.c b/ext/standard/string.c
index 7a4ae2e..9ab6b5f 100644
--- a/ext/standard/string.c
+++ b/ext/standard/string.c
@@ -5202,7 +5202,7 @@ PHP_FUNCTION(str_word_count)
 
        while (p < e) {
                s = p;
-               while (p < e && (isalpha((unsigned char)*p) || (char_list && 
ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) {
+               while (p < e && (iswalpha((unsigned char)*p) || (char_list && 
ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) {
                        p++;
                }
                if (p > s) {


Test script:
---------------
<?php
// existing str_word_count function for comparison
print str_word_count("PHP function str_word_count does not properly handle non-latin characters") . "\n";
// returns 11
print str_word_count("Хабилло житель Яванского района. Ему 70 лет. Он женат. У него четверо детей. Хабилло филолог. Он более двадцати лет работает по профессии. Также Хабилло занимается виноградарством. У него имеется небольшой виноградник. Этим видом деятельности Хабилло занимается 15 лет.");
// returns 0

// new function mb_str_word_count
print mb_str_word_count("Хабилло житель Яванского района. Ему 70 лет. Он женат. У него четверо детей. Хабилло филолог. Он более двадцати лет работает по профессии. Также Хабилло занимается виноградарством. У него имеется небольшой виноградник. Этим видом деятельности Хабилло занимается 15 лет.");
// returns 37

Expected result:
----------------
Using mb_str_word_count() will return the number of words in a string containing 
multibyte characters

Actual result:
--------------
Currently there is no mb_str_word_count() function. Using str_word_count() on a 
string with multibyte characters returns 0.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-12-03 03:14 UTC] aharvey@php.net
-Package: *Unicode Issues +Package: mbstring related
 [2016-06-30 10:47 UTC] cmb@php.net
It occurs to me that an mb_str_word_count() should not use
iswalpha(), because the latter is depending on the current system
locale, but mbstring usually doesn't. That doesn't mean that a
locale aware str_word_count() wouldn't be useful, but one could
easily implement an mb_str_word_count() in userland. Simplified
and non-optimized:

  function mb_str_word_count($string) {
      return count(mb_split('[\s_"], $string));
  }
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Oct 08 02:01:28 2024 UTC