|   | php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login | 
| 
  [2017-09-27 21:21 UTC] gerthaubrich at web dot de
 Description: ------------ Is there a reason why mb_regex_encoding() (and Oniguruma library) does not support CP1251 and GB18030 in PHP? (Because general mbstring functions do support these encodings.) I have checked PHP 5.4.45 (Oniguruma 4.7.1) up to PHP 7.0.23 and 7.1.9 (5.9.6) and now also 7.2.0 RC2 (6.3.0) and the function always accepts the same list of encodings over all these different library versions. According to Oniguruma history, support for GB18030 has been implemented in v3.8.4 and support for CP1251 (alias Windows-1251) has been implemented in v5.2.0 of the library. So I am wondering if this is a feature request or a bug in PHP? How does the function check for a valid/supported encoding name in PHP? Is the request passed through to the library itself, or is there some kind of whitelist implemented in PHP before the function call (that hasn't been updated for quiet some time)? Bug #23470 (05/2003 !) mentions option --enable-mbstring=all for CP1251, but this "all" value has never been mentioned in the latest configure scripts, I remember (e.g. the last 3 years or so). Does it still exist? For example the 7.2.0 RC2 source contains in \ext\mbstring\oniguruma\src also the files cp1251.c and gb18030.c, which are also referenced in config.w32 and in config.m4 and oniguruma.h defines ONIG_ENCODING_CP1251 and ONIG_ENCODING_GB18030. One additional thought for PHP 7.2: Maybe the team should consider to bump the Oniguruma version to at least version 6.4.0, instead of 6.3.0. The history mentions fixed memory leaks and a "endless repeat" error for 6.4.0 and only a few new features. BR PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits             | |||||||||||||||||||||||||||||||||||||
|  Copyright © 2001-2025 The PHP Group All rights reserved. | Last updated: Fri Oct 31 05:00:02 2025 UTC | 
In this case it seems, that the list, starting at line 186 "php_mb_regex_enc_name_map_t enc_name_map[] = { ...", prohibits the usage of GB18030 and CP1251, simply because these encodings are missing in the list. The list contains all the encodings, which are currently accepted by mb_regex_encoding(). Interestingly, the list include some aliases, which are not returned by mb_encoding_aliases(). For example "BIG5" is missing and is accepted by mb_internal_encoding() and by mb_regex_encoding(). On the other hand, "ISO8859-1" is accepted by RegEnc but not IntEnc. For example for ASCII "ISO646" is accepted by RegEnc, but not by IntEnc, the latter only acccepts "ISO646-US", which in turn is not accepted by RegEnc. What a mess ... ;-)I've made a simple test via modifying php_mbregex.c: php_mb_regex_enc_name_map_t enc_name_map[] = { #ifdef ONIG_ENCODING_GB18030 { "GB18030\0GB-18030\0GB-18030-2000\0", ONIG_ENCODING_GB18030 }, #endif #ifdef ONIG_ENCODING_CP1251 { "CP1251\0CP-1251\0WINDOWS-1251\0", ONIG_ENCODING_CP1251 }, #endif #ifdef ONIG_ENCODING_EUC_JP { [ ... ] and recompiled 7.2.0RC2. Of course, I got a unit test error in mb_regex_encoding_variation2.phpt, but mb_regex_encoding() now does accept these two encodings (and aliases) and a preliminary test with CP1251 worked with something like '\p{Alpha}{3,6}', but this might not be a trustworthy result, because CP1251 is a simple 8-bit charset. For GB18030 I do not have the language skills to test it myself or even provide a useful RegEx. (The macros ONIG_ENCODING_GB18030 and ONIG_ENCODING_CP1251 are already defined in oniguruma.h, line 233/235)