php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #31805 first kanji count 6 characters more than it should
Submitted: 2005-02-02 07:21 UTC Modified: 2005-02-11 05:05 UTC
From: gullevek at gullevek dot org Assigned:
Status: Not a bug Package: mbstring related
PHP Version: 4.3.10 OS: gnu/linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: gullevek at gullevek dot org
New email:
PHP Version: OS:

 

 [2005-02-02 07:21 UTC] gullevek at gullevek dot org
Description:
------------
If you want to get a string length of a string with japanese kanji, then with 4.3.10 the first kanji counts 8 characters instead of 2. Any other double byte character afterwards is counted as 2 bytes.

The problem is, mb_strlen should return only 1 and not 2. If I could with strlen there should be 2.

I get the wrong return with all ways. With no default charset set, with default charset set, with giving charsets on the mb_strlen function, getting it via the mb_detect_encoding. It always returns the wrong length.

This was not in versions before 4.3.10.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-02-02 08:40 UTC] derick@php.net
Thank you for this bug report. To properly diagnose the problem, we
need a short but complete example script to be able to reproduce
this bug ourselves. 

A proper reproducing script starts with <?php and ends with ?>,
is max. 10-20 lines long and does not require any external 
resources such as databases, etc.

If possible, make the script source available online and provide
an URL to it here. Try to avoid embedding huge scripts into the report.
 [2005-02-02 08:41 UTC] derick@php.net
Please post the script somewhere online and provide a link (otherwise your Kanji might screw up in our form).
 [2005-02-03 02:20 UTC] gullevek at gullevek dot org
okay, it is not 100% a bug perhaps. problem is, if you have iso-2022-jp encoded data, and you don't have default set, php doesn't read it correctly (because iso-2022-jp is encoded very differently).
see example below. enter two characters, one 1 bit (eg a) and one two bit (eg あ). then you will see, in the output with no iso set, the length is wrong. But I don't know why 4.3.10 behaves different to 4.3.9 ...

<?php
import_request_variables("p");
if ($send)
{
	echo "S: $string<br>";
	echo "D: ".mb_detect_encoding($string,"iso-2022-jp")."<br>";	
	echo strlen($string)." -- without iso: ".mb_strlen($string)." -- with iso".mb_strlen($string,"iso-2022-jp")."<br>";
}
?>
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-2022-JP">
</head>
<body>
<form method="post" name="foo" enctype="multipart/form-data">
<input type="text" name="string" size="50" value="<? echo $string; ?>"><br>
<input type="submit" name="send" value="Send">
</form></body></html>
 [2005-02-03 02:25 UTC] gullevek at gullevek dot org
one more comment.
the problem actually occoured, because mb_detect_enconding detects utf-8, even if the string is iso-2022-jp
 [2005-02-03 07:45 UTC] moriyoshi@php.net
You look somewhat confused. First off, ISO-2022-JP is a 
"stateful" multibyte encoding and quite different from 
other stateless multibyte encodings such as Shift_JIS 
, EUC-JP and UTF-8.

What makes it "stateful" are escape sequences used to 
determine in which way consecutive octets following such 
an escape sequence are interpreted by the 
implementation.

With ISO-2022-JP, a single hiragana character most 
likely ends up with 8 bytes in a stream due to prepended 
"Shift-in" and appended "Shift-out" which are needed to 
switch the interpretation mode, to "JIS-kanji" and to 
"ASCII" respectively, while two hiragana characters 
would result in 10 bytes because those escape sequences 
are only needed when entering a chunk of multibyte 
"JIS-kanji" characters and leaving the chunk.

If the problem you are experiencing are actually caused 
by the wrong encoding detection, then setting 
mbstring.language to "Japanese" may fix it.

Encoding detection is based on a kind of heuristics, so 
its behaviour may vary between the releases.


 [2005-02-11 02:39 UTC] gullevek at gullevek dot org
okay I will play around with the mbstring.language settings. But I think that bug can be closed as false alarm from me. I am sorry.
 [2005-02-11 05:05 UTC] sniper@php.net
Bogused by user request.

 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Thu Sep 23 18:03:37 2021 UTC