PHP :: Bug #63732 :: unicode strings not handled correctly

Bug #63732	unicode strings not handled correctly
Submitted:	2012-12-09 07:38 UTC	Modified:	2012-12-12 04:49 UTC
From:	jmichae3 at yahoo dot com	Assigned:
Status:	Not a bug	Package:	Scripting Engine problem
PHP Version:	5.3.19	OS:	linux
Private report:	No	CVE-ID:	None

View Developer Edit

[2012-12-09 07:38 UTC] jmichae3 at yahoo dot com

Description:
------------
I am getting russian characters in my meail forms. I want to compare the characters to see if they are > '~' which is the last visible character in the ascii character set.
this comparison does not work. in UNICODE, these characters are about 1024,  and ~ is 126 according to ord().

ord() thinks EVERY character is ascii. this is far from true.  there are mb characters from utf8.

this is russian random characters from charmap: ЋϊЁγϋГИБЫЫЏАДрмдп

in fact, I don't have any working way to detect whether a character is KOI8-R or ASCII, or cyrillic, or whether the character ordinal number is actually beyond 127 or not. because according to ord(), it's all within 0-255.





Test script:
---------------
<?php
$s="п"; //russian character
echo substr_compare($s,"~",0,1);
    echo "\n";
$i=0;
for ($i=0; $i < strlen($s); $i++) {
	if (substr_compare($s[$i],"~",0,1) > 0) {
		echo "OK";
	} else {
		echo "fail";
	}
	if (ord($s[$i]) > 126) {
		echo "OK";
	} else {
		echo "fail";
	}
	if ($s[$i] > '~') {
		echo "OK";
	} else {
		echo "fail";
	}
	echo ord($s[$i]);
}
echo "\n";
$i=0;
/*
strangely enough, I get 2 outputs with only 1 character.
Sat 12/08/2012 23:12:46.76||E:\www\jimm|>php t.php
1
OKOKOK208OKOKOK191

Sat 12/08/2012 23:14:27.34||E:\www\jimm|>
*/
?>


Expected result:
----------------
whole characters as a single unit. 1 result.

Actual result:
--------------
got 2 results from 1 UNICODE russian character in a string. should only get 1. 
this file was encoded with utf8 without bom.
php is splitting the utf8 characters into a byte stream when it gets to strlen(). or it just treats unicode and utf8 characters like ascii.
this does not work well when trying to use mb_detect_encoding() - that breaks ability to detect encodings when it breaks up characters like that. nearly everything with strings actually.
this also breaks ability to detect foreign spam.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2012-12-10 02:24 UTC] aharvey@php.net

-Status: Open +Status: Not a bug

[2012-12-10 02:24 UTC] aharvey@php.net

PHP strings are effectively byte arrays, and ord() only looks at the first byte. This is documented behaviour.

[2012-12-11 17:22 UTC] jmichae3 at yahoo dot com

it may be documented behavior, but it still doesn't provide a solution to the problem.

[2012-12-11 22:22 UTC] rasmus@php.net

This is a bug reporting system. You reported a bug on a function that is behaving 
as intended and as documented. This is not a support forum. There are plenty of 
ways to do what you need. Start by reading about the mbstring functions.

[2012-12-12 00:34 UTC] jmichae3 at yahoo dot com

if you were to take the time to do the research, there is no function in PHP except ord() for converting a character [from a string] to a number. maybe strings need to be handled differently internally in php to handle UNICODE. or maybe ord simply needs to be rewritten so it works so matter what character encoding is thrown at it. it would be difficult, but extremely useful, since it is the only function. I took the time to look through the mb functions. there was nothing to help me. 

I tried looking through the mb functions, there wasn't a compare. there wasn't a way to compare. I consider a function like that to be crucial if relops are not safe or capable of doing it. if that is the case, please make one, and an mb function for returning the ordinal value of an mb char. the functionality is just not there. thanks. much appreciated.

unicode/mb-related bug database stuff:
https://bugs.php.net/bug.php?id=49439
https://bugs.php.net/bug.php?id=63732
just search the database for anything with mb_encode or unicode. there are a number of bugs related to this problem.

[2012-12-12 02:38 UTC] rasmus@php.net

Personally I'd just convert from utf8 to iso-8959-1 or whichever encoding you 
are looking for here instead of checking each character. But if you really do 
want to do it, it isn't very hard. You just need to understand what UTF-8 looks 
like and it becomes a simple 5-line function in userspace:

function utf8_ord($c) {
    $b0 = ord($c[0]);
    if($b0 < 0x10) return $b0;
    $b1 = ord($c[1]);
    if($b0 < 0xE0 )return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);
    return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($c[2]) & 0x3F);
}

But you have to understand that there is absolutely no way to accurately detect 
the encoding of a short sequence of bytes. The above will work if you know the 
input is UTF-8. There is no way to write a magic function which will tell you 
the encoding from a couple of bytes of data which you seem to imply we should 
provide you.

[2012-12-12 04:36 UTC] jmichae3 at yahoo dot com

this code might be moreuseful, I am going to give it to you.
I know there is unicode-16 and unicode-32 and such.
if the string can hanbdle stuff like that, there really should be an internal function for that which also handles this internally.  because although this is useful and I can use it, it is a workaround rather than a real and complete solution for multiple encodings such as you would find listed with mb_list_encodings().

//returns ordinal value of character in string $str at $index 
//and increments $index past current utf-8 character.
function utf8_ord_next_char($str, &$index) { 
    $b0 = ord($str[$index + 0]);
    if ($b0 < 0x10) {
		$index++;
		return $b0;
    }
    $b1 = ord($str[$index + 1]);
    if ($b0 < 0xE0) {
		$index += 2;
		return (($b0 & 0x1F) << 6) + ($b1 & 0x3F);
    }
	$index += 3;
    return (($b0 & 0x0F) << 12) + (($b1 & 0x3F) << 6) + (ord($str[$index + 2]) & 0x3F);
}


so for detecting non-ascii languages,
	//detect foreign languages
	for ($i=0;$i < strlen($comment);) {
		if (utf8_ord_next_char($comment,$i) > 126) {
			echo "<div style='color:red;'>ERRORb</div>";
			return true; //error
		}
	}

[2012-12-12 04:49 UTC] rasmus@php.net

You have just slightly reformatted the code I gave you.

[2012-12-12 06:34 UTC] jmichae3 at yahoo dot com

something is wrong with the code you gave me. it doesn't work.
also, it doesn't follow Ken Thompson's table shown below in URL in comment.
there are any other reference documents (wikipedia).

one thing I discovered about this code is that: 
- the page must be encoded as UTF-8 without BOM (such as using notepad++, Encoding, Convert to utf8 without BOM.
- you must also include a meta tag 
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />


better coded utf-8 only version and one that actually works:
(and by the way, this works with ascii too)

//returns ordinal value of character in string $str at $index 
//and increments $index past current utf-8 character.
//based on the table at http://doc.cat-v.org/bell_labs/utf-8_history
function utf8_ord_next_char($str, &$index) { 
	if ($index+1 <= strlen($str) 
		&& 0x80 == 0x80 & ord($str[$index + 0])
		) {
		$result =  
			(ord($str[$index+0])&0x7f) ;
		$index += 1;
		return $result;
	} else if ($index+2 <= strlen($str) 
		&& 0xc0 == 0xe0 & ord($str[$index + 0]) 
		&& 0x80 == 0xc0 & ord($str[$index + 1])
		) {
		$result =  
			(ord($str[$index+0])&0x1f) + 
			(ord($str[$index+1])&0x3f) ;
		$index += 2;
		return $result;
	} else if ($index+3 <= strlen($str) 
		&& 0xe0 == 0xf0 & ord($str[$index + 0])) {
		&& 0x80 == 0xc0 & ord($str[$index + 1])
		&& 0x80 == 0xc0 & ord($str[$index + 2])
		) {
		$result =  
			(ord($str[$index+0])&0x0f) + 
			(ord($str[$index+1])&0x3f) +
			(ord($str[$index+2])&0x3f) ;
		$index += 3;
		return $result;
	} else if ($index+4 <= strlen($str) 
		&& 0xf0 == 0xf8 & ord($str[$index + 0])) {
		&& 0x80 == 0xc0 & ord($str[$index + 1])
		&& 0x80 == 0xc0 & ord($str[$index + 2])
		&& 0x80 == 0xc0 & ord($str[$index + 3])
		) {
		$result =  
			(ord($str[$index+0])&0x07) + 
			(ord($str[$index+1])&0x3f) +
			(ord($str[$index+2])&0x3f) +
			(ord($str[$index+3])&0x3f) ;
		$index += 4;
		return $result;
	} else if ($index+5 <= strlen($str) 
		&& 0xf8 == 0xfc & ord($str[$index + 0])) {
		&& 0x80 == 0xc0 & ord($str[$index + 1])
		&& 0x80 == 0xc0 & ord($str[$index + 2])
		&& 0x80 == 0xc0 & ord($str[$index + 3])
		&& 0x80 == 0xc0 & ord($str[$index + 4])
		) {
		$result =  
			(ord($str[$index+0])&0x03) + 
			(ord($str[$index+1])&0x3f) +
			(ord($str[$index+2])&0x3f) +
			(ord($str[$index+3])&0x3f) +
			(ord($str[$index+4])&0x3f) ;
		$index += 5;
		return $result;
	} else if ($index+6 <= strlen($str) 
		&& 0xfc == 0xfe & ord($str[$index + 0])) {
		&& 0x80 == 0xc0 & ord($str[$index + 1])
		&& 0x80 == 0xc0 & ord($str[$index + 2])
		&& 0x80 == 0xc0 & ord($str[$index + 3])
		&& 0x80 == 0xc0 & ord($str[$index + 4])
		&& 0x80 == 0xc0 & ord($str[$index + 5])
		) {
		$result =  
			(ord($str[$index+0])&0x01) + 
			(ord($str[$index+1])&0x3f) +
			(ord($str[$index+2])&0x3f) +
			(ord($str[$index+3])&0x3f) +
			(ord($str[$index+4])&0x3f) +
			(ord($str[$index+5])&0x3f) ;
		$index += 6;
		return $result;
	}
        //unknown case
	$result = ord($str[$index+0]);
	$index++;
	return $result;
}

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Mon Jul 07 00:01:35 2025 UTC