php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #54688 case insensitive search of stripos does not work when searching äöü in utf-8
Submitted: 2011-05-08 17:00 UTC Modified: 2011-05-09 09:20 UTC
From: g dot huebgen at arcor dot de Assigned:
Status: Not a bug Package: Strings related
PHP Version: 5.3.6 OS: Linux
Private report: No CVE-ID: None
 [2011-05-08 17:00 UTC] g dot huebgen at arcor dot de
Description:
------------
---
From manual page: http://www.php.net/function.stripos#Description
---
If some text is encoded in UTF-8 and I search this text with stripos for a string with (e.g.) lower case Umlaut (e.g. ü), this function does not find the upper-case Umlaut (Ü). That means case-insensitive does not work for Umlauts if a text file is encoded UTF-8.

Test script:
---------------
File test.txt contains "Übermut" and is encoded UTF-8 without BOM
<?php
$text = file_get_contents("test.txt");
echo $text."<br>";
$str = "über";
if (($pos=stripos($text,$str)) !== false)
	echo $str." gefunden";
else echo $str." nicht gefunden";
?>

Expected result:
----------------
Übermut
über gefunden 

Actual result:
--------------
Übermut
über nicht gefunden 

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-05-08 17:43 UTC] rasmus@php.net
-Status: Open +Status: Bogus
 [2011-05-08 17:43 UTC] rasmus@php.net
This is not a bug. The base string handling functions in PHP do not support 
multibyte character sets. Since UTF-8 is compatible with single-byte charsets at 
the low end, it may appear to work for UTF-8, but it will break as soon as you hit 
an actual mb character. You can use mb_stripos() in this case, or you can use the 
function overloading support in mbstring to make your stripos mb aware. 

See http://de.php.net/manual/en/mbstring.overload.php
 [2011-05-08 20:17 UTC] g dot huebgen at arcor dot de
Hi rasmus.
Now I tried mb_stripos but the result is not different to stripos.
The same program but using mb_stripos:
$text = file_get_contents("test-utf8.txt");
$str = "über";
if (($pos=mb_stripos($text,$str)) !== false)
	echo $str." found";
else echo $str." not found";

output is: not found!

If I use utf8_decode for both $text and $str then stripos will work properly.
 [2011-05-08 20:25 UTC] rasmus@php.net
That means your string is not actually in UTF-8. utf8_decode() converts text in 
ISO-8859-1 to UTF-8. You stated initially that you had text encoded in UTF-8.
 [2011-05-09 06:33 UTC] g dot huebgen at arcor dot de
The description of utf8_decode states clearly that this function decodes UTF8 text. The manual says:
"utf8_decode — Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1"

So my text is indeed in UTF-8 and my remark on utf8_decode only confirms what rasmus (comment #1) said.
 [2011-05-09 06:44 UTC] rasmus@php.net
Well, somewhere along the way you have messed up your encoding since it works 
fine when both strings are UTF-8:

var_dump(mb_stripos("Übermut","über",0,"UTF-8"));

Are you saying that this doesn't give you int(0) on your platform?
 [2011-05-09 09:20 UTC] g dot huebgen at arcor dot de
You are right. Your mb_stripos works fine. 
My mistake in this was that I forgot the parameter "UTF-8"!
Now everything is clear.
Thank you
Gerhard
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 05:01:30 2024 UTC