php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #22041 mb_substr produces "mojibake" on certain strings ...
Submitted: 2003-02-04 05:28 UTC Modified: 2003-02-05 08:27 UTC
From: jc at mega-bucks dot co dot jp Assigned:
Status: Not a bug Package: mbstring related
PHP Version: 4.3.0 OS: Red Hat Linux 7.2
Private report: No CVE-ID: None
 [2003-02-04 05:28 UTC] jc at mega-bucks dot co dot jp
First, sorry for any offensive japanese words. I can't read/write japanese very well, and the error in mb_substr occurs on data from a list of video titles ... I tried to find another less offensive example but couldn't. I'm just posting this bug report in order to help ...

I am trying to use mb_substr on data I get from a postgreSQL DB and in some cases mb_substr seems to cut the string in the middle of a multibyte char .. which turns the "cut" char into mojibake ...

The DB is in EUC-JP and my internal encoding is set to EUC-JP in my php.ini file ...

As you can see the last character of the string has been improperly cut ...

Here is my test program and output:

CODE:

<?php
require_once("db_functions/sql_query.inc");

$sql = "select maker_comment from products where id=12802";
$res = sql_query($sql);
$dat = pg_fetch_object($res);
$c = $dat->substr;

echo "String: <BR>";
echo $c ."<BR>";

$c = mb_substr($c, 0, 80);

echo "<BR> After cutting it ... <BR>";
echo $c ."<BR>";
?>

OUPUT:

COMMENT2:
??????Ρ?Ķ-?Դ֤Υ?????ץ??꡼???ģء?³???о졪?????κ??Ԥ??ΤǤϤ????ޤ?????;ʬ?ʲ褬?ʤ?

AFTER cutting it ...
??????Ρ?Ķ-?Դ֤Υ?????ץ??꡼???ģء?³???о졪?????κ??Ԥ??ΤǤϤ?????&#65533;

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-02-04 07:36 UTC] moriyoshi@php.net
LOL! It's indeed so OFFENSIVE I have no idea how to translate those words to English. But perhaps you know what that means?

Ehm, first try setting the internal encoding to "eucJP-win".
 [2003-02-04 21:20 UTC] jc at mega-bucks dot co dot jp
Glad you could see the funny side of this bug report :) I did try very hard to find a better example ... but couldn't get mb_substr to break on anything else.

Why set internal encoding to eucJP-win? The data is from a database and is in EUC-JP ...

When I entered the data into the DB if the were any illegal EUC-JP characters it should have complained ...

And as you can see I can display the whole string as EUC-JP perfectly. It's only *after* I use mb_substr() that the string becomes mojibake ...

Thanks!
 [2003-02-05 07:47 UTC] moriyoshi@php.net
Since mb_substr() internally converts input strings to the Unicode character set representation, if it find such an "illegal" character that is not supposed to be a member of the input character set, it simply ends up returning wrong results. eucjp-win is prepared for convenience so that users can handle strings whose components are represented in CP932 character set and encoded in EUC-JP.

Practically, there are some EUC-JP variants because EUC-JP itself originally represents just an encoding rather than a whole character set. 

I think this practice is quite confusing too, but please keep it in your mind that an encoding doesn't always have a single corresponding character set even though their names are the same. In this context, it could be said EUC-JP is rather a name of an encoding and often mistaken as a character set name, where the actual names of character sets which EUC-JP _can_ represent are ISO646, JISX0201-1976, JISX0208-1990, JISX0212-1990, JISX0213-2000, and so on.

Anyway, did you try it out?

 [2003-02-05 07:55 UTC] jc at mega-bucks dot co dot jp
Wow, thanks for the long answer! I didn't realize that EUC-JP was not a single character set ...

I tried what you suggested and that fixed the problem ...

But now makes me wonder what character set my data is in?? And I set my Postgresql database to be EUC-JP, but since you say that could mean more than one thing, I wonder which one PostgreSQL uses??

Since I am so confused as to what format my data is in, I ended up using the database's substr() function instead of PHP's ... I figure that is safer ...

So I guess the is no problem with mb_substr() then ... just that even though the DB says the data is in EUC-JP it really is in eucJP-win?

Thanks!

PS You can close the report if you agree that there is no error in mb_substr()

PPS I love PHP's mb functions, thanks for your work. I just wish the world would agree on ONE japanese encoding =) It would save me a lot of headaches ...
 [2003-02-05 08:27 UTC] moriyoshi@php.net
First, let me mark this report as bogus, because it doesn't appear a bug after all.

> But now makes me wonder what character set my data is in?? And I set my

It's up to the browser you are using how posted form data are encoded and submitted to a php script.

> Postgresql database to be EUC-JP, but since you say that could mean more
> than one thing, I wonder which one PostgreSQL uses??

Inside PostgreSQL, "EUC-JP" encoded characters are handled in a character-set independent way, Whilst it is treated as a name of a charset - encoding mapping in mbstring.



 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Oct 12 09:01:27 2024 UTC