PHP :: Bug #22041 :: mb_substr produces "mojibake" on certain strings ...

Bug #22041	mb_substr produces "mojibake" on certain strings ...
Submitted:	2003-02-04 05:28 UTC	Modified:	2003-02-05 08:27 UTC
From:	jc at mega-bucks dot co dot jp	Assigned:
Status:	Not a bug	Package:	mbstring related
PHP Version:	4.3.0	OS:	Red Hat Linux 7.2
Private report:	No	CVE-ID:	None

View Developer Edit

[2003-02-04 05:28 UTC] jc at mega-bucks dot co dot jp

First, sorry for any offensive japanese words. I can't read/write japanese very well, and the error in mb_substr occurs on data from a list of video titles ... I tried to find another less offensive example but couldn't. I'm just posting this bug report in order to help ...

I am trying to use mb_substr on data I get from a postgreSQL DB and in some cases mb_substr seems to cut the string in the middle of a multibyte char .. which turns the "cut" char into mojibake ...

The DB is in EUC-JP and my internal encoding is set to EUC-JP in my php.ini file ...

As you can see the last character of the string has been improperly cut ...

Here is my test program and output:

CODE:

<?php
require_once("db_functions/sql_query.inc");

$sql = "select maker_comment from products where id=12802";
$res = sql_query($sql);
$dat = pg_fetch_object($res);
$c = $dat->substr;

echo "String: <BR>";
echo $c ."<BR>";

$c = mb_substr($c, 0, 80);

echo "<BR> After cutting it ... <BR>";
echo $c ."<BR>";
?>

OUPUT:

COMMENT2:
??????Ρ?Ķ-?Դ֤Υ?????ץ??꡼???ģء?³???о졪?????κ??Ԥ??ΤǤϤ????ޤ?????;ʬ?ʲ褬?ʤ?

AFTER cutting it ...
??????Ρ?Ķ-?Դ֤Υ?????ץ??꡼???ģء?³???о졪?????κ??Ԥ??ΤǤϤ?????&#65533;

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2003-02-04 07:36 UTC] moriyoshi@php.net

LOL! It's indeed so OFFENSIVE I have no idea how to translate those words to English. But perhaps you know what that means?

Ehm, first try setting the internal encoding to "eucJP-win".

[2003-02-04 21:20 UTC] jc at mega-bucks dot co dot jp

Glad you could see the funny side of this bug report :) I did try very hard to find a better example ... but couldn't get mb_substr to break on anything else.

Why set internal encoding to eucJP-win? The data is from a database and is in EUC-JP ...

When I entered the data into the DB if the were any illegal EUC-JP characters it should have complained ...

And as you can see I can display the whole string as EUC-JP perfectly. It's only *after* I use mb_substr() that the string becomes mojibake ...

Thanks!

[2003-02-05 07:47 UTC] moriyoshi@php.net

Since mb_substr() internally converts input strings to the Unicode character set representation, if it find such an "illegal" character that is not supposed to be a member of the input character set, it simply ends up returning wrong results. eucjp-win is prepared for convenience so that users can handle strings whose components are represented in CP932 character set and encoded in EUC-JP.

Practically, there are some EUC-JP variants because EUC-JP itself originally represents just an encoding rather than a whole character set. 

I think this practice is quite confusing too, but please keep it in your mind that an encoding doesn't always have a single corresponding character set even though their names are the same. In this context, it could be said EUC-JP is rather a name of an encoding and often mistaken as a character set name, where the actual names of character sets which EUC-JP _can_ represent are ISO646, JISX0201-1976, JISX0208-1990, JISX0212-1990, JISX0213-2000, and so on.

Anyway, did you try it out?

[2003-02-05 07:55 UTC] jc at mega-bucks dot co dot jp

Wow, thanks for the long answer! I didn't realize that EUC-JP was not a single character set ...

I tried what you suggested and that fixed the problem ...

But now makes me wonder what character set my data is in?? And I set my Postgresql database to be EUC-JP, but since you say that could mean more than one thing, I wonder which one PostgreSQL uses??

Since I am so confused as to what format my data is in, I ended up using the database's substr() function instead of PHP's ... I figure that is safer ...

So I guess the is no problem with mb_substr() then ... just that even though the DB says the data is in EUC-JP it really is in eucJP-win?

Thanks!

PS You can close the report if you agree that there is no error in mb_substr()

PPS I love PHP's mb functions, thanks for your work. I just wish the world would agree on ONE japanese encoding =) It would save me a lot of headaches ...

[2003-02-05 08:27 UTC] moriyoshi@php.net

First, let me mark this report as bogus, because it doesn't appear a bug after all.

> But now makes me wonder what character set my data is in?? And I set my

It's up to the browser you are using how posted form data are encoded and submitted to a php script.

> Postgresql database to be EUC-JP, but since you say that could mean more
> than one thing, I wonder which one PostgreSQL uses??

Inside PostgreSQL, "EUC-JP" encoded characters are handled in a character-set independent way, Whilst it is treated as a name of a charset - encoding mapping in mbstring.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Fri Jun 26 22:00:02 2026 UTC