php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77777 str_word_count function for Chinese text
Submitted: 2019-03-21 04:52 UTC Modified: 2019-03-22 13:25 UTC
From: deng5765 at gmail dot com Assigned:
Status: Open Package: Strings related
PHP Version: 7.3.3 OS: CentOS and MacOS
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: deng5765 at gmail dot com
New email:
PHP Version: OS:

 

 [2019-03-21 04:52 UTC] deng5765 at gmail dot com
Description:
------------
Different behaviour of str_word_count function in different operating system(CentOS and MacOS) for Chinese text. 

It return 0 for pure Chinese text under CentOS. I tested PHP 7.1 and PHP 7.2, same issue.

Seems can't find any clue in doc.

Test script:
---------------
<?php
$content = '我不是鱼测试';
echo 'PHP OS: ', PHP_OS, "\n";
echo 'PHP Version: ', PHP_VERSION, "\n";
echo 'Word Count:  ', str_word_count($content), "\n";


Expected result:
----------------
PHP OS: Darwin
PHP Version: 7.3.3
Word Count:  6

PHP OS: Linux
PHP Version: 7.3.3
Word Count:  6

Actual result:
--------------
PHP OS: Darwin
PHP Version: 7.3.3
Word Count:  6


PHP OS: Linux
PHP Version: 7.3.3
Word Count:  0

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2019-03-21 05:01 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2019-03-21 05:01 UTC] requinix@php.net
> Counts the number of words inside string.

> For the purpose of this function, 'word' is defined as a locale dependent string containing alphabetic characters,
> which also may contain, but not start with "'" and "-" characters.

What is your locale set to?
 [2019-03-21 05:55 UTC] deng5765 at gmail dot com
Ah, sorry, didn't notice that part.

Mac:
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=


CentOS:
LANG=en_US.utf-8
LC_CTYPE="en_US.utf-8"
LC_NUMERIC="en_US.utf-8"
LC_TIME="en_US.utf-8"
LC_COLLATE="en_US.utf-8"
LC_MONETARY="en_US.utf-8"
LC_MESSAGES="en_US.utf-8"
LC_PAPER="en_US.utf-8"
LC_NAME="en_US.utf-8"
LC_ADDRESS="en_US.utf-8"
LC_TELEPHONE="en_US.utf-8"
LC_MEASUREMENT="en_US.utf-8"
LC_IDENTIFICATION="en_US.utf-8"
LC_ALL=en_US.utf-8
 [2019-03-21 07:24 UTC] deng5765 at gmail dot com
-Status: Feedback +Status: Open
 [2019-03-21 07:24 UTC] deng5765 at gmail dot com
I set the locale like 

<?php
$locale = setlocale(LC_ALL, 'zh_CN.UTF-8');
$content = '我不是鱼测试';
echo 'Locale: ', $locale, "\n";
echo 'PHP OS: ', PHP_OS, "\n";
echo 'PHP Version: ', PHP_VERSION, "\n";
echo 'Word Count:  ', str_word_count($content), "\n";


but I still got the different values:

Locale: zh_CN.UTF-8
PHP OS: Linux
PHP Version: 7.3.3
Word Count:  0


Locale: zh_CN.UTF-8
PHP OS: Darwin
PHP Version: 7.3.3
Word Count:  6


Is it normal?
 [2019-03-22 10:09 UTC] cmb@php.net
Please try on both architectures:

  var_dump(ctype_alpha('我不是鱼测试'));
 [2019-03-22 13:25 UTC] deng5765 at gmail dot com
both got: 

bool(false)
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 17:01:58 2024 UTC