php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77093 mb_ereg_replace() does not work with SJIS-win(cp932)
Submitted: 2018-11-02 01:32 UTC Modified: 2018-11-05 01:37 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:1 (50.0%)
Same OS:2 (100.0%)
From: ryosuke dot kobayashi at fujisystems dot co dot jp Assigned:
Status: Re-Opened Package: mbstring related
PHP Version: 7.2.11 OS: Linux
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: ryosuke dot kobayashi at fujisystems dot co dot jp
New email:
PHP Version: OS:

 

 [2018-11-02 01:32 UTC] ryosuke dot kobayashi at fujisystems dot co dot jp
Description:
------------
mb_ereg_replace() just returns null when given string contains some specific characters in 'SJIS-win' (e.g. 'ⅰ', '伃'...), and it works without output errors. 
Between 0xFA40('ⅰ') and 0xFC4B('黑') causes this bug, imo.

It also happens with mb_ereg_match*().

I confirmed that this happens PHP Version 7.1 or higher. Here's a results I tried.

PHP 7.0.32 with oniguruma 5.9.6  => It works.
PHP 7.0.32 with oniguruma 6.3.0  => It works.
PHP 7.1.23 with oniguruma 5.9.6  => It does not work.
PHP 7.2.11 with oniguruma 6.3.0  => It does not work.



Test script:
---------------
function chk($a,$b){
mb_internal_encoding('SJIS-win');
$j=0;
for($i=$a;$i<$b;$i++){
  $s=sprintf('%x',$i);
  $hex = hex2bin($s);
  if(mb_check_encoding($hex)){
    if (!mb_ereg($hex, $hex)){
      echo "$s($hex):NG\n";
    }
  }else{
  }
  $j++;
}
echo "cnt:$j\n";
}
chk(0xED40,0xFC51);



Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-11-02 07:53 UTC] yohgaki@php.net
-Status: Open +Status: Not a bug
 [2018-11-02 07:53 UTC] yohgaki@php.net
You need to use mb_regex_encoding() to specify the mb_regex encoding.
https://3v4l.org/SYRJP
 [2018-11-02 08:06 UTC] yohgaki@php.net
Mbstring's internal and regex encoding is independent.

Before 5.6/7.0, mbregex's default encoding was EUC-JP which ISO 8859-1 compatible. So it worked in most cases. Since 5.6/7.0, I made default to UTF-8.
https://wiki.php.net/rfc/default_encoding

Anyway, to make sure correct operations, you'll need to set correct encoding via mb_regex_encoding() for mb_ereg*(). I might have to take a look at where the difference came from, though.

If you notice anything wrong, please let us know.
 [2018-11-02 09:17 UTC] ryosuke dot kobayashi at fujisystems dot co dot jp
Thank you for your reply.

>You need to use mb_regex_encoding() to specify the mb_regex encoding.
Exactry... I forgotted to add this cuz my Environment works on SJIS-win.

I fixed my test code, and tried it here.
https://3v4l.org/m6JZe

Now, the problem has occurd I wanted to point out.

Best regards.
 [2018-11-02 09:54 UTC] cmb@php.net
-Status: Not a bug +Status: Re-Opened
 [2018-11-02 09:54 UTC] cmb@php.net
<https://3v4l.org/m6JZe> looks like a bug.
 [2018-11-03 03:34 UTC] yohgaki@php.net
It seems encoding validation is failing somehow and returning FALSE for it.
 [2018-11-03 23:16 UTC] yohgaki@php.net
I briefly checked code. It seems the difference came from supported encoding between mbstring and Onigruma. Mbstring has 'SJIS-win' encoding while Oniguruma has only 'SJIS'. Any SJIS valiants are validated as 'SJIS'.

As a result, Current (newer) code is trying to validate 'SJIS-win' as 'SJIS' which will fail in certain cases.

Following code should be fixed to address this bug. i.e. php_mb_check_encoding() needs 'SJIS-win' from '_php_mb_regex_mbctype2name(MBREX(current_mbctype))' in this case, not 'SJIS'.

php_mbregex.c
	if (!php_mb_check_encoding(
	string,
	string_len,
	_php_mb_regex_mbctype2name(MBREX(current_mbctype))
	)) {

Using 'SJIS' as mbregex encoding wouldn't fix issue.
https://3v4l.org/P56Zg
There should be other issue.
 [2018-11-03 23:22 UTC] yohgaki@php.net
I suppose returning exact encoding, i.e. SJIS-win, from _php_mb_regex_mbctype2name(MBREX(current_mbctype)) would fix this bug.
 [2018-11-05 01:13 UTC] ryosuke dot kobayashi at fujisystems dot co dot jp
thanks for the inquiry.

I could understand the reason.

So, are these results come from same reason?
https://3v4l.org/BiA4b
 [2018-11-05 01:37 UTC] yohgaki@php.net
I suppose it would be fixed also, since mb_ereg_replace() (and it's valiants) is returning NULL for invalid encoding.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Nov 25 08:01:32 2024 UTC