php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77777 str_word_count function for Chinese text
Submitted: 2019-03-21 04:52 UTC Modified: 2019-03-22 13:25 UTC
From: deng5765 at gmail dot com Assigned:
Status: Open Package: Strings related
PHP Version: 7.3.3 OS: CentOS and MacOS
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2019-03-21 04:52 UTC] deng5765 at gmail dot com
Description:
------------
Different behaviour of str_word_count function in different operating system(CentOS and MacOS) for Chinese text. 

It return 0 for pure Chinese text under CentOS. I tested PHP 7.1 and PHP 7.2, same issue.

Seems can't find any clue in doc.

Test script:
---------------
<?php
$content = '我不是鱼测试';
echo 'PHP OS: ', PHP_OS, "\n";
echo 'PHP Version: ', PHP_VERSION, "\n";
echo 'Word Count:  ', str_word_count($content), "\n";


Expected result:
----------------
PHP OS: Darwin
PHP Version: 7.3.3
Word Count:  6

PHP OS: Linux
PHP Version: 7.3.3
Word Count:  6

Actual result:
--------------
PHP OS: Darwin
PHP Version: 7.3.3
Word Count:  6


PHP OS: Linux
PHP Version: 7.3.3
Word Count:  0

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-03-21 05:01 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2019-03-21 05:01 UTC] requinix@php.net
> Counts the number of words inside string.

> For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters,
> which also may contain, but not start with "'" and "-" characters.

What is your locale set to?
 [2019-03-21 05:55 UTC] deng5765 at gmail dot com
Ah, sorry, didn't notice that part.

Mac:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=


CentOS:
LANG=en_US.utf-8
LC_CTYPE="en_US.utf-8"
LC_NUMERIC="en_US.utf-8"
LC_TIME="en_US.utf-8"
LC_COLLATE="en_US.utf-8"
LC_MONETARY="en_US.utf-8"
LC_MESSAGES="en_US.utf-8"
LC_PAPER="en_US.utf-8"
LC_NAME="en_US.utf-8"
LC_ADDRESS="en_US.utf-8"
LC_TELEPHONE="en_US.utf-8"
LC_MEASUREMENT="en_US.utf-8"
LC_IDENTIFICATION="en_US.utf-8"
LC_ALL=en_US.utf-8
 [2019-03-21 07:24 UTC] deng5765 at gmail dot com
-Status: Feedback +Status: Open
 [2019-03-21 07:24 UTC] deng5765 at gmail dot com
I set the locale like 

<?php
$locale = setlocale(LC_ALL, 'zh_CN.UTF-8');
$content = '我不是鱼测试';
echo 'Locale: ', $locale, "\n";
echo 'PHP OS: ', PHP_OS, "\n";
echo 'PHP Version: ', PHP_VERSION, "\n";
echo 'Word Count:  ', str_word_count($content), "\n";


but I still got the different values:

Locale: zh_CN.UTF-8
PHP OS: Linux
PHP Version: 7.3.3
Word Count:  0


Locale: zh_CN.UTF-8
PHP OS: Darwin
PHP Version: 7.3.3
Word Count:  6


Is it normal?
 [2019-03-22 10:09 UTC] cmb@php.net
Please try on both architectures:

  var_dump(ctype_alpha('我不是鱼测试'));
 [2019-03-22 13:25 UTC] deng5765 at gmail dot com
both got: 

bool(false)
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Thu Apr 25 11:01:25 2019 UTC