php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #41147 mb_check_encoding fails to check invalid string
Submitted: 2007-04-20 09:21 UTC Modified: 2007-09-24 12:31 UTC
From: teracci2002 at yahoo dot co dot jp Assigned: hirokawa (profile)
Status: Closed Package: mbstring related
PHP Version: 5.2.1 OS: Linux
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: teracci2002 at yahoo dot co dot jp
New email:
PHP Version: OS:

 

 [2007-04-20 09:21 UTC] teracci2002 at yahoo dot co dot jp
Description:
------------
mb_check_encoding returns true when specific invalid EUC-JP / Shift_JIS / UTF-8 char sequence supplied.


Reproduce code:
---------------
//(1)
var_dump(mb_check_encoding("\x00\xA1", "EUC-JP"));
//(2)
var_dump(mb_check_encoding("\x00\x81", "Shift_JIS"));
//(3)
var_dump(mb_check_encoding("\x00\xE3", "UTF-8"));

Expected result:
----------------
//(1)
bool(false)
//(2)
bool(false)
//(3)
bool(false)

Actual result:
--------------
//(1)
bool(true)
//(2)
bool(true)
//(3)
bool(true)


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-04-20 09:27 UTC] tony2001@php.net
Please explain why do you think it should succeed if the data is invalid.
 [2007-04-20 10:11 UTC] teracci2002 at yahoo dot co dot jp
If the data is NOT valid, FALSE should be returned, I guess.
But actually it returns TRUE.
Am I wrong or missing your point?
 [2007-08-18 14:45 UTC] hirokawa@php.net
It is expected behavior because 
0x00,0xa1 is null byte + valid ShiftJIS character.
mb_check_encoding should be used to detect invalid 
or corrupted multibyte characters.

 [2007-08-18 16:00 UTC] teracci2002 at yahoo dot co dot jp
Just read bug report again.

No one says 0x00,0xa1 is invalid character in ShiftJIS.
 [2007-08-19 20:10 UTC] jani@php.net
Someone disagrees, Rui.. :)
 [2007-09-04 14:30 UTC] hirokawa@php.net
> No one says 0x00,0xa1 is invalid character in ShiftJIS.
I didn't say that.

0x00+0xa1 is valid byte sequence in Shift_JIS sequence.
A character in Shift_JIS encoding is encoded in either single byte 
or double byte.
In this case, the byte stream is reconigzed as two character,
a null byte and a comma character in Katakana(0xa1) 
 
see: http://hp.vector.co.jp/authors/VA013241/misc/shiftjis.html


 [2007-09-04 14:55 UTC] teracci2002 at yahoo dot co dot jp
> 0x00+0xa1 is valid byte sequence in Shift_JIS sequence.

I know it.
But 0x00+0x81 is invalid sequence in Shift_JIS.
Then, why below statement returns "bool(true)" ?

var_dump(mb_check_encoding("\x00\x81", "Shift_JIS"));

Read bug report again, please.
 [2007-09-04 22:38 UTC] jani@php.net
Did you read it Rui? (why do your reports end up as 'Analyzed' all the time? :)
 [2007-09-16 08:56 UTC] hirokawa@php.net
Sorry for delaying response.

0x00,0x81 is also valid byte sequence in Shift_JIS
because 0x81 is a valid first byte of a double-byte 
JIS X 0208 character.

See: http://en.wikipedia.org/wiki/Shift_jis

We cannot decide the byte stream is valid or 
invalid because the last byte of byte stream (0x81)
is a valid first byte of double-byte character.
In this case, true (valid) will be returned.

The byte stream including a valid first byte +
a invalid second byte returns false.

For example,

var_dump(mb_check_encoding("\x81\x00", "Shift_JIS"));

returns false (invalid).

It is because 0x81 is valid first byte of a double-byte
JIS X0208 character, but, 0x00 is invalid second byte of
a double-byte JIS X0208 character.

And, 
0x00, 0xe3 in UTF-8, it is also 
valid byte sequence (a null byte + first byte of 
a three-byte UTF-8 character).

See: http://en.wikipedia.org/wiki/UTF-8









 [2007-09-19 20:52 UTC] mike at silverorange dot com
0x00, 0xe3 is a valid byte sequence in UTF-8 but by itself is not a valid UTF-8 string (it's missing two bytes).

The function is documented as checking the validity of a string so it should return false for this case. If the function is only supposed to validate byte-streams then the documentation should be fixed.
 [2007-09-24 10:15 UTC] hirokawa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

It is a documentation problem, and it is already fixed in CVS.

 [2007-09-24 12:31 UTC] teracci2002 at yahoo dot co dot jp
I guess the problem is not only in the document.

var_dump(mb_check_encoding("\x00\xE3","UTF-8"));
=> bool(true)        may be checking validity of "byte streams"

var_dump(mb_check_encoding("\xE3", "UTF-8"));
=> bool(false)       may be checking validity of "string"

# I hope that this function checks validity of "string", not "byte streams" (but this is just my opinion).
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Dec 27 14:01:29 2024 UTC