php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #77093 mb_ereg_replace() does not work with SJIS-win(cp932)
Submitted: 2018-11-02 01:32 UTC Modified: 2018-11-05 01:37 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:1 (50.0%)
Same OS:2 (100.0%)
From: ryosuke dot kobayashi at fujisystems dot co dot jp Assigned:
Status: Re-Opened Package: mbstring related
PHP Version: 7.2.11 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: ryosuke dot kobayashi at fujisystems dot co dot jp
New email:
PHP Version: OS:

 

 [2018-11-02 01:32 UTC] ryosuke dot kobayashi at fujisystems dot co dot jp
Description:
------------
mb_ereg_replace() just returns null when given string contains some specific characters in 'SJIS-win' (e.g. 'ⅰ', '伃'...), and it works without output errors. 
Between 0xFA40('ⅰ') and 0xFC4B('黑') causes this bug, imo.

It also happens with mb_ereg_match*().

I confirmed that this happens PHP Version 7.1 or higher. Here's a results I tried.

PHP 7.0.32 with oniguruma 5.9.6  => It works.
PHP 7.0.32 with oniguruma 6.3.0  => It works.
PHP 7.1.23 with oniguruma 5.9.6  => It does not work.
PHP 7.2.11 with oniguruma 6.3.0  => It does not work.



Test script:
---------------
function chk($a,$b){
mb_internal_encoding('SJIS-win');
$j=0;
for($i=$a;$i<$b;$i++){
  $s=sprintf('%x',$i);
  $hex = hex2bin($s);
  if(mb_check_encoding($hex)){
    if (!mb_ereg($hex, $hex)){
      echo "$s($hex):NG\n";
    }
  }else{
  }
  $j++;
}
echo "cnt:$j\n";
}
chk(0xED40,0xFC51);



Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2018-11-02 07:53 UTC] yohgaki@php.net
-Status: Open +Status: Not a bug
 [2018-11-02 07:53 UTC] yohgaki@php.net
You need to use mb_regex_encoding() to specify the mb_regex encoding.
https://3v4l.org/SYRJP
 [2018-11-02 08:06 UTC] yohgaki@php.net
Mbstring's internal and regex encoding is independent.

Before 5.6/7.0, mbregex's default encoding was EUC-JP which ISO 8859-1 compatible. So it worked in most cases. Since 5.6/7.0, I made default to UTF-8.
https://wiki.php.net/rfc/default_encoding

Anyway, to make sure correct operations, you'll need to set correct encoding via mb_regex_encoding() for mb_ereg*(). I might have to take a look at where the difference came from, though.

If you notice anything wrong, please let us know.
 [2018-11-02 09:17 UTC] ryosuke dot kobayashi at fujisystems dot co dot jp
Thank you for your reply.

>You need to use mb_regex_encoding() to specify the mb_regex encoding.
Exactry... I forgotted to add this cuz my Environment works on SJIS-win.

I fixed my test code, and tried it here.
https://3v4l.org/m6JZe

Now, the problem has occurd I wanted to point out.

Best regards.
 [2018-11-02 09:54 UTC] cmb@php.net
-Status: Not a bug +Status: Re-Opened
 [2018-11-02 09:54 UTC] cmb@php.net
<https://3v4l.org/m6JZe> looks like a bug.
 [2018-11-03 03:34 UTC] yohgaki@php.net
It seems encoding validation is failing somehow and returning FALSE for it.
 [2018-11-03 23:16 UTC] yohgaki@php.net
I briefly checked code. It seems the difference came from supported encoding between mbstring and Onigruma. Mbstring has 'SJIS-win' encoding while Oniguruma has only 'SJIS'. Any SJIS valiants are validated as 'SJIS'.

As a result, Current (newer) code is trying to validate 'SJIS-win' as 'SJIS' which will fail in certain cases.

Following code should be fixed to address this bug. i.e. php_mb_check_encoding() needs 'SJIS-win' from '_php_mb_regex_mbctype2name(MBREX(current_mbctype))' in this case, not 'SJIS'.

php_mbregex.c
	if (!php_mb_check_encoding(
	string,
	string_len,
	_php_mb_regex_mbctype2name(MBREX(current_mbctype))
	)) {

Using 'SJIS' as mbregex encoding wouldn't fix issue.
https://3v4l.org/P56Zg
There should be other issue.
 [2018-11-03 23:22 UTC] yohgaki@php.net
I suppose returning exact encoding, i.e. SJIS-win, from _php_mb_regex_mbctype2name(MBREX(current_mbctype)) would fix this bug.
 [2018-11-05 01:13 UTC] ryosuke dot kobayashi at fujisystems dot co dot jp
thanks for the inquiry.

I could understand the reason.

So, are these results come from same reason?
https://3v4l.org/BiA4b
 [2018-11-05 01:37 UTC] yohgaki@php.net
I suppose it would be fixed also, since mb_ereg_replace() (and it's valiants) is returning NULL for invalid encoding.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Nov 25 10:01:32 2024 UTC