php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #24309 mb_detect_encoding return EUC-JP for invalid EUC-JP char sequence
Submitted: 2003-06-24 02:52 UTC Modified: 2003-07-13 02:36 UTC
Votes:1
Avg. Score:4.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: jc at mega-bucks dot co dot jp Assigned: hirokawa (profile)
Status: Closed Package: mbstring related
PHP Version: 4.3.3RC1 OS: Linux
Private report: No CVE-ID: None
 [2003-06-24 02:52 UTC] jc at mega-bucks dot co dot jp
Description:
------------
I've just run into a strange "bug". I have a form on my web site that
takes input from the user and then uses that to do a search of a
postgresql database.

The form is set to be EUC-JP, but this weekend a user submitted a query
that postgres reject because it "contains invalid EUC-JP" characters.
Luckily the error was logged and I was able to track it down.

I thought that maybe the user had entered some bad characters in the
form or used some strange encoding so I should better check to make sure
that the encoding of the submitted form data really is EUC-JP using
mb_detect_encoding(). But unfortunately mb_detect_encoding() says that
the invalid string *is* in EUC-JP!?

The query string is as it appears in the URL is:

search_words=%B7%F6%BA%7E

In the script that parses this query I have put the following:

$words = $_GET["words"];
$enc = mb_detect_encoding($aI["words"]);
echo "encoding is $enc and the query is ($words)";die;

The result is:

encoding is EUC-JP and the query is (喧?)

As you can see the query string is *not* a valid EUC-JP sequence ...

Reproduce code:
---------------
$words = $_GET["words"];
$enc = mb_detect_encoding($aI["words"]);
echo "encoding is $enc and the query is ($words)";die;

Expected result:
----------------
SJIS (?) or Undefined.

mb_detect_encoding() does not specify what it returns if an invalid character sequence for which the encoding cannot be detectec is passed in.

In the above case the character sequence is valid SJIS I believe ...

Actual result:
--------------
EUC-JP

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-06-28 09:40 UTC] hirokawa@php.net
URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
B7E6+BA7E, which is correct EUC-JP character sequence.

<?php // sample code
$str_euc = sprintf("%c%c%c%c",0xb7,0xf6,0xba,0x7e);
echo mb_detect_encoding($str_euc); // output is 'EUC-JP'
?>

Encoding detection is not perfect, it may make mistake if the length of string is too short.

But, I believe encoding detection of mbstring works fine in this case.
B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is correct EUC-JP byte sequence.




 [2003-06-30 07:49 UTC] hirokawa@php.net
It is not a bug of mbstring.
0xb7,0xf6,0xba,0x7e is a correct byte seqence of EUC-JP.

 [2003-06-30 20:01 UTC] jc at mega-bucks dot co dot jp
Are you sure? ^_^

I am not an encoding expert so if you say that it is a valid sequence I believe you but ...

I am using postgreSQL as a database and it says that it is not a valid EUC sequence. So either PHP is wrong or the database is wrong :)

Here is my test code:

  echo "Checking $string .......<BR>";
  $sql = "select id from products where name like '$string'";
  $conn = pg_connect("host=$IP port=5432 dbname=$DB user=postgres");
  $res  = pg_query($conn, $sql);
  $err_msg = pg_last_error($conn);
  if (preg_match("/Invalid EUC_JP character sequence found/", $err_msg)) {
    echo "NOT VALID<BR>";
  }

The error message returned by the DB is:

pg_query(): Query failed: ERROR:  Invalid EUC_JP character sequence found (0xba7e)

The output is:

Checking &#21927;&#65533; .......
NOT VALID

I'll post this to the postgreSQL developer's list also in case it is a bug in postgreSQL.

If you are certain that this character sequence is valid can you point me to a ressource I can use to show the postgreSQL team that they have a bug that needs fixing?

Thanks!
 [2003-07-01 21:39 UTC] jc at mega-bucks dot co dot jp
hirokawa wrote:

"URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
B7E6+BA7E, which is correct EUC-JP character sequence."

You mean the decoded sequence is B7F6+BA7E, not B7E6+BA7E, right?

"B7E6+BA7E [...] is correct EUC-JP byte sequence."

Again do you mean that B7F6+BA7E is correct EUC-JP? I don't think it is ...

Thanks.
 [2003-07-07 06:15 UTC] sniper@php.net
Assigned to the one person who knows what this is about.. :)

 [2003-07-13 02:36 UTC] hirokawa@php.net
This bug has been fixed in CVS.

In case this was a PHP problem, snapshots of the sources are packaged
every three hours; this change will be in the next snapshot. You can
grab the snapshot at http://snaps.php.net/.
 
In case this was a documentation problem, the fix will show up soon at
http://www.php.net/manual/.

In case this was a PHP.net website problem, the change will show
up on the PHP.net site and on the mirror sites in short time.
 
Thank you for the report, and for helping us make PHP better.

I made a mistake.
B7F6 is correct byte code in EUC-JP. But,
BA7E is not correct byte code in EUC-JP.
So, it is not correct EUC-JP byte sequence.

mb_detect_encoding() can choose a best candidate in 
encoding list, but, it can't detect the corruption.
mb_detect_encoding() assumes byte characters are not corrupted and it stops the detection if the number of 
candidate is only one.

I added 'strict detection mode' to detect corrupted string in CVS tree.
You should specify TRUE in third argument of mb_detect_encoding to use strict detection mode.

<?php
 $str = urldecode('%B7%F6%BA%7E');
 echo mb_detect_encoding($str); // output: 'EUC-JP'
 echo mb_detect_encoding($str,"ASCII,JIS,UTF-8,EUC-JP,SJIS", TRUE); // output: NULL (strict mode)
 echo mb_detect_encoding($str,NULL,TRUE); // output: NULL (strict mode, second argument is ommited.)
?>










 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Wed Dec 08 02:03:34 2021 UTC