php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #63663 str_word_count does not properly handle non-latin characters
Submitted: 2012-12-01 02:29 UTC Modified: 2021-03-31 13:33 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:1 (50.0%)
Same OS:1 (50.0%)
From: kobrien at kiva dot org Assigned: cmb (profile)
Status: Closed Package: Strings related
PHP Version: 5.3.20-dev OS: Ubuntu 12.04
Private report: No CVE-ID: None
 [2012-12-01 02:29 UTC] kobrien at kiva dot org
Description:
------------
The function str_word_count() does work properly on non-latin characters. It will 
return a value of zero. Whereas str_word_count() works properly on latin 
characters and returns the value for the number of words in a string.

Test script:
---------------
<?php
print str_word_count("PHP function str_word_count does not properly handle non-latin characters") . "\n";

// returns 11

print str_word_count("Хабилло житель Яванского района. Ему 70 лет. Он женат. У него четверо детей. Хабилло филолог. Он более двадцати лет работает по профессии. Также Хабилло занимается виноградарством. У него имеется небольшой виноградник. Этим видом деятельности Хабилло занимается 15 лет.");

// returns 0, but should return 37

Expected result:
----------------
The second instruction should return 37

Actual result:
--------------
The second instruction returns 0

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-12-03 02:29 UTC] aharvey@php.net
This is due to the use of isalpha() internally, which doesn't play well with multibyte encodings like UTF-8, regardless of the locale setting.

Fundamentally, this is the same issue as bug #27668 — I'm not sure there's a lot we can do about this in PHP 5.x, but it's worth noting if and when we revisit Unicode string handling internally.
 [2012-12-03 02:29 UTC] aharvey@php.net
-Status: Open +Status: Analyzed
 [2012-12-03 02:36 UTC] kobrien at kiva dot org
Thanks for the reply. Given your comments about the problems, would it be helpful 
for me to also file a feature request for newer versions of php to have a 
mb_str_word_count function which could properly handle this case? I haven't dug 
into the C code enough to understand why isalpha() fails on multibyte, but I'd 
have to imagine there is an alternative available that will handle multi-byte 
characters properly. I could potentially even create a patch if pointed in the 
right direction.
 [2012-12-03 02:47 UTC] aharvey@php.net
Yeah, a feature request for mb_str_word_count() might be a good idea.

The isalpha() issue isn't really PHP specific: the underlying C function simply takes a single byte as its input, so it can't ascertain whether a multibyte character is actually alphanumeric or not (since it only ever gets the first byte of the sequence). There's an iswalpha() function that would do the right thing, but PHP was written before it was widely available, and using it in str_word_count() alone would be inconsistent with the rest of the language: it's something we'd need to think about as part of making the whole language more multibyte-aware.
 [2012-12-03 03:10 UTC] kobrien at kiva dot org
Ok feature request filed here: https://bugs.php.net/bug.php?id=63671
First time doing that, so hopefully it's correctly filed.
 [2021-03-31 13:33 UTC] cmb@php.net
-Type: Bug +Type: Documentation Problem -Assigned To: +Assigned To: cmb
 [2021-03-31 13:33 UTC] cmb@php.net
> […] it's something we'd need to think about as part of making
> the whole language more multibyte-aware.

That's unlikely going to happen, so I'm changing to doc bug.
 [2021-03-31 13:35 UTC] git@php.net
Automatic comment on behalf of cmb69
Revision: https://github.com/php/doc-en/commit/c73b00a6d7f799e3e5189a316efa06b7ef3c0fe6
Log: Fix #63663: str_word_count does not properly handle non-latin characters
 [2021-03-31 13:35 UTC] git@php.net
-Status: Analyzed +Status: Closed
 [2021-04-01 01:31 UTC] git@php.net
Automatic comment on behalf of mumumu
Revision: https://github.com/php/doc-ja/commit/73e020f525667d9deb151768580ef755c7626204
Log: Fix #63663: str_word_count does not properly handle non-latin characters
 [2021-04-15 21:36 UTC] git@php.net
Automatic comment on behalf of Girgias
Revision: https://github.com/php/doc-fr/commit/aa154ae70dc3d059d8f2c029053f7c5afd8116f3
Log: Fix #63663: str_word_count does not properly handle non-latin characters
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Nov 09 08:01:28 2024 UTC