php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #75269 mb_​regex_​encoding() does not support GB18030 or CP1251
Submitted: 2017-09-27 21:21 UTC Modified: 2017-09-28 11:01 UTC
From: gerthaubrich at web dot de Assigned:
Status: Open Package: mbstring related
PHP Version: Irrelevant OS: Debain 8 x64
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2017-09-27 21:21 UTC] gerthaubrich at web dot de
Description:
------------
Is there a reason why mb_​regex_​encoding() (and Oniguruma library) does not support CP1251 and GB18030 in PHP? (Because general mbstring functions do support these encodings.)

I have checked PHP 5.4.45 (Oniguruma 4.7.1) up to PHP 7.0.23 and 7.1.9 (5.9.6) and now also 7.2.0 RC2 (6.3.0) and the function always accepts the same list of encodings over all these different library versions.
According to Oniguruma history, support for GB18030 has been implemented in v3.8.4 and support for CP1251 (alias Windows-1251) has been implemented in v5.2.0 of the library.

So I am wondering if this is a feature request or a bug in PHP?
How does the function check for a valid/supported encoding name in PHP? Is the request passed through to the library itself, or is there some kind of whitelist implemented in PHP before the function call (that hasn't been updated for quiet some time)?

Bug #23470 (05/2003 !) mentions option --enable-mbstring=all for CP1251, but this "all" value has never been mentioned in the latest configure scripts, I remember (e.g. the last 3 years or so). Does it still exist?

For example the 7.2.0 RC2 source contains in \ext\mbstring\oniguruma\src also the files cp1251.c and gb18030.c, which are also referenced in config.w32 and in config.m4 and oniguruma.h defines ONIG_ENCODING_CP1251 and ONIG_ENCODING_GB18030.



One additional thought for PHP 7.2:
Maybe the team should consider to bump the Oniguruma version to at least version 6.4.0, instead of 6.3.0. The history mentions fixed memory leaks and a "endless repeat" error for 6.4.0 and only a few new features.

BR


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-09-28 11:01 UTC] cmb@php.net
> How does the function check for a valid/supported encoding name in PHP? Is the
> request passed through to the library itself, or is there some kind of 
> whitelist implemented in PHP before the function call (that hasn't been
> updated for quiet some time)?

From a quick glance, it looks like the latter, see
<https://lxr.room11.org/xref/php-src%40master/ext/mbstring/php_mbregex.c#enc_name_map>.
 [2017-09-28 16:38 UTC] gerthaubrich at web dot de
In this case it seems, that the list, starting at line 186 "php_mb_regex_enc_name_map_t enc_name_map[] = { ...", prohibits the usage of GB18030 and CP1251, simply because these encodings are missing in the list.
The list contains all the encodings, which are currently accepted by mb_​regex_​encoding().

Interestingly, the list include some aliases, which are not returned by mb_encoding_aliases(). For example "BIG5" is missing and is accepted by mb_internal_encoding() and by mb_regex_encoding(). On the other hand, "ISO8859-1" is accepted by RegEnc but not IntEnc. For example for ASCII "ISO646" is accepted by RegEnc, but not by IntEnc, the latter only acccepts "ISO646-US", which in turn is not accepted by RegEnc. What a mess ... ;-)
 [2017-09-28 21:05 UTC] gerthaubrich at web dot de
I've made a simple test via modifying php_mbregex.c:

php_mb_regex_enc_name_map_t enc_name_map[] = {
#ifdef ONIG_ENCODING_GB18030
	{
		"GB18030\0GB-18030\0GB-18030-2000\0",
		ONIG_ENCODING_GB18030
	},
#endif
#ifdef ONIG_ENCODING_CP1251
	{
		"CP1251\0CP-1251\0WINDOWS-1251\0",
		ONIG_ENCODING_CP1251
	},
#endif
#ifdef ONIG_ENCODING_EUC_JP
	{
[ ... ]

and recompiled 7.2.0RC2. Of course, I got a unit test error in mb_regex_encoding_variation2.phpt, but mb_regex_encoding() now does accept these two encodings (and aliases) and a preliminary test with CP1251 worked with something like '\p{Alpha}{3,6}', but this might not be a trustworthy result, because CP1251 is a simple 8-bit charset.
For GB18030 I do not have the language skills to test it myself or even provide a useful RegEx.

(The macros ONIG_ENCODING_GB18030 and ONIG_ENCODING_CP1251 are already defined in oniguruma.h, line 233/235)
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Sat May 25 21:01:27 2019 UTC