php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #22041 mb_substr produces "mojibake" on certain strings ...
Submitted: 2003-02-04 05:28 UTC Modified: 2003-02-05 08:27 UTC
From: jc at mega-bucks dot co dot jp Assigned:
Status: Not a bug Package: mbstring related
PHP Version: 4.3.0 OS: Red Hat Linux 7.2
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: jc at mega-bucks dot co dot jp
New email:
PHP Version: OS:

 

 [2003-02-04 05:28 UTC] jc at mega-bucks dot co dot jp
First, sorry for any offensive japanese words. I can't read/write japanese very well, and the error in mb_substr occurs on data from a list of video titles ... I tried to find another less offensive example but couldn't. I'm just posting this bug report in order to help ...

I am trying to use mb_substr on data I get from a postgreSQL DB and in some cases mb_substr seems to cut the string in the middle of a multibyte char .. which turns the "cut" char into mojibake ...

The DB is in EUC-JP and my internal encoding is set to EUC-JP in my php.ini file ...

As you can see the last character of the string has been improperly cut ...

Here is my test program and output:

CODE:

<?php
require_once("db_functions/sql_query.inc");

$sql = "select maker_comment from products where id=12802";
$res = sql_query($sql);
$dat = pg_fetch_object($res);
$c = $dat->substr;

echo "String: <BR>";
echo $c ."<BR>";

$c = mb_substr($c, 0, 80);

echo "<BR> After cutting it ... <BR>";
echo $c ."<BR>";
?>

OUPUT:

COMMENT2:
??????Ρ?Ķ-?Դ֤Υ?????ץ??꡼???ģء?³???о졪?????κ??Ԥ??ΤǤϤ????ޤ?????;ʬ?ʲ褬?ʤ?

AFTER cutting it ...
??????Ρ?Ķ-?Դ֤Υ?????ץ??꡼???ģء?³???о졪?????κ??Ԥ??ΤǤϤ?????&#65533;

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-02-04 07:36 UTC] moriyoshi@php.net
LOL! It's indeed so OFFENSIVE I have no idea how to translate those words to English. But perhaps you know what that means?

Ehm, first try setting the internal encoding to "eucJP-win".
 [2003-02-04 21:20 UTC] jc at mega-bucks dot co dot jp
Glad you could see the funny side of this bug report :) I did try very hard to find a better example ... but couldn't get mb_substr to break on anything else.

Why set internal encoding to eucJP-win? The data is from a database and is in EUC-JP ...

When I entered the data into the DB if the were any illegal EUC-JP characters it should have complained ...

And as you can see I can display the whole string as EUC-JP perfectly. It's only *after* I use mb_substr() that the string becomes mojibake ...

Thanks!
 [2003-02-05 07:47 UTC] moriyoshi@php.net
Since mb_substr() internally converts input strings to the Unicode character set representation, if it find such an "illegal" character that is not supposed to be a member of the input character set, it simply ends up returning wrong results. eucjp-win is prepared for convenience so that users can handle strings whose components are represented in CP932 character set and encoded in EUC-JP.

Practically, there are some EUC-JP variants because EUC-JP itself originally represents just an encoding rather than a whole character set. 

I think this practice is quite confusing too, but please keep it in your mind that an encoding doesn't always have a single corresponding character set even though their names are the same. In this context, it could be said EUC-JP is rather a name of an encoding and often mistaken as a character set name, where the actual names of character sets which EUC-JP _can_ represent are ISO646, JISX0201-1976, JISX0208-1990, JISX0212-1990, JISX0213-2000, and so on.

Anyway, did you try it out?

 [2003-02-05 07:55 UTC] jc at mega-bucks dot co dot jp
Wow, thanks for the long answer! I didn't realize that EUC-JP was not a single character set ...

I tried what you suggested and that fixed the problem ...

But now makes me wonder what character set my data is in?? And I set my Postgresql database to be EUC-JP, but since you say that could mean more than one thing, I wonder which one PostgreSQL uses??

Since I am so confused as to what format my data is in, I ended up using the database's substr() function instead of PHP's ... I figure that is safer ...

So I guess the is no problem with mb_substr() then ... just that even though the DB says the data is in EUC-JP it really is in eucJP-win?

Thanks!

PS You can close the report if you agree that there is no error in mb_substr()

PPS I love PHP's mb functions, thanks for your work. I just wish the world would agree on ONE japanese encoding =) It would save me a lot of headaches ...
 [2003-02-05 08:27 UTC] moriyoshi@php.net
First, let me mark this report as bogus, because it doesn't appear a bug after all.

> But now makes me wonder what character set my data is in?? And I set my

It's up to the browser you are using how posted form data are encoded and submitted to a php script.

> Postgresql database to be EUC-JP, but since you say that could mean more
> than one thing, I wonder which one PostgreSQL uses??

Inside PostgreSQL, "EUC-JP" encoded characters are handled in a character-set independent way, Whilst it is treated as a name of a charset - encoding mapping in mbstring.



 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Oct 27 16:01:27 2024 UTC