php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #41147 mb_check_encoding fails to check invalid string
Submitted: 2007-04-20 09:21 UTC Modified: 2007-09-24 12:31 UTC
From: teracci2002 at yahoo dot co dot jp Assigned: hirokawa (profile)
Status: Closed Package: mbstring related
PHP Version: 5.2.1 OS: Linux
Private report: No CVE-ID: None
 [2007-04-20 09:21 UTC] teracci2002 at yahoo dot co dot jp
Description:
------------
mb_check_encoding returns true when specific invalid EUC-JP / Shift_JIS / UTF-8 char sequence supplied.


Reproduce code:
---------------
//(1)
var_dump(mb_check_encoding("\x00\xA1", "EUC-JP"));
//(2)
var_dump(mb_check_encoding("\x00\x81", "Shift_JIS"));
//(3)
var_dump(mb_check_encoding("\x00\xE3", "UTF-8"));

Expected result:
----------------
//(1)
bool(false)
//(2)
bool(false)
//(3)
bool(false)

Actual result:
--------------
//(1)
bool(true)
//(2)
bool(true)
//(3)
bool(true)


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-04-20 09:27 UTC] tony2001@php.net
Please explain why do you think it should succeed if the data is invalid.
 [2007-04-20 10:11 UTC] teracci2002 at yahoo dot co dot jp
If the data is NOT valid, FALSE should be returned, I guess.
But actually it returns TRUE.
Am I wrong or missing your point?
 [2007-08-18 14:45 UTC] hirokawa@php.net
It is expected behavior because 
0x00,0xa1 is null byte + valid ShiftJIS character.
mb_check_encoding should be used to detect invalid 
or corrupted multibyte characters.

 [2007-08-18 16:00 UTC] teracci2002 at yahoo dot co dot jp
Just read bug report again.

No one says 0x00,0xa1 is invalid character in ShiftJIS.
 [2007-08-19 20:10 UTC] jani@php.net
Someone disagrees, Rui.. :)
 [2007-09-04 14:30 UTC] hirokawa@php.net
> No one says 0x00,0xa1 is invalid character in ShiftJIS.
I didn't say that.

0x00+0xa1 is valid byte sequence in Shift_JIS sequence.
A character in Shift_JIS encoding is encoded in either single byte 
or double byte.
In this case, the byte stream is reconigzed as two character,
a null byte and a comma character in Katakana(0xa1) 
 
see: http://hp.vector.co.jp/authors/VA013241/misc/shiftjis.html


 [2007-09-04 14:55 UTC] teracci2002 at yahoo dot co dot jp
> 0x00+0xa1 is valid byte sequence in Shift_JIS sequence.

I know it.
But 0x00+0x81 is invalid sequence in Shift_JIS.
Then, why below statement returns "bool(true)" ?

var_dump(mb_check_encoding("\x00\x81", "Shift_JIS"));

Read bug report again, please.
 [2007-09-04 22:38 UTC] jani@php.net
Did you read it Rui? (why do your reports end up as 'Analyzed' all the time? :)
 [2007-09-16 08:56 UTC] hirokawa@php.net
Sorry for delaying response.

0x00,0x81 is also valid byte sequence in Shift_JIS
because 0x81 is a valid first byte of a double-byte 
JIS X 0208 character.

See: http://en.wikipedia.org/wiki/Shift_jis

We cannot decide the byte stream is valid or 
invalid because the last byte of byte stream (0x81)
is a valid first byte of double-byte character.
In this case, true (valid) will be returned.

The byte stream including a valid first byte +
a invalid second byte returns false.

For example,

var_dump(mb_check_encoding("\x81\x00", "Shift_JIS"));

returns false (invalid).

It is because 0x81 is valid first byte of a double-byte
JIS X0208 character, but, 0x00 is invalid second byte of
a double-byte JIS X0208 character.

And, 
0x00, 0xe3 in UTF-8, it is also 
valid byte sequence (a null byte + first byte of 
a three-byte UTF-8 character).

See: http://en.wikipedia.org/wiki/UTF-8









 [2007-09-19 20:52 UTC] mike at silverorange dot com
0x00, 0xe3 is a valid byte sequence in UTF-8 but by itself is not a valid UTF-8 string (it's missing two bytes).

The function is documented as checking the validity of a string so it should return false for this case. If the function is only supposed to validate byte-streams then the documentation should be fixed.
 [2007-09-24 10:15 UTC] hirokawa@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

It is a documentation problem, and it is already fixed in CVS.

 [2007-09-24 12:31 UTC] teracci2002 at yahoo dot co dot jp
I guess the problem is not only in the document.

var_dump(mb_check_encoding("\x00\xE3","UTF-8"));
=> bool(true)        may be checking validity of "byte streams"

var_dump(mb_check_encoding("\xE3", "UTF-8"));
=> bool(false)       may be checking validity of "string"

# I hope that this function checks validity of "string", not "byte streams" (but this is just my opinion).
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Apr 23 23:01:29 2024 UTC