php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #66507 string replace does not work on certain points on the string
Submitted: 2014-01-17 08:59 UTC Modified: 2014-01-17 10:31 UTC
From: hugo at domibay dot es Assigned:
Status: Not a bug Package: mbstring related
PHP Version: 5.4.24 OS: Centos 6.4
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: hugo at domibay dot es
New email:
PHP Version: OS:

 

 [2014-01-17 08:59 UTC] hugo at domibay dot es
Description:
------------
I want to introduce certain chinese text into a database. The SQL Query requires that all apostrophes are escaped.
Like "'" -> "\\'" .
As I want to process chinese and other languages that don't use latin characters I opted for a string replace with "mb_ereg_replace()".
I found that "mb_ereg_replace()" could replace latin character names within the Text and also apostrophes between latin character passages but was unable to do that when a chinese character was just before the apostrophes.
Like "位于't Goor Park" -> "位于\\'t Goor Park"
I tried to use this command to achieve this. On many texts it worked, but on this text it failed.
$srs = mb_ereg_replace("[\']", "\\'", $srs);

Curious was that I can detect the apostrophe with "mb_strpos()", but I can't replace it.
$iapops = mb_strpos($srs, "'");

Test script:
---------------
This Script shows that some passages can be replaces but that important apostrophe can't be touched, but yes it can be detected.

$iapops = mb_strpos($srs, "'");

if($iapops !== false)
{
  echo "apostrophe found on '$iapops'\n";

  echo "test 0: '$srs'\n";

  $srs = mb_ereg_replace("(goor)", "[GREEN]", $srs, "i");

  echo "test 1: '$srs'\n";

  $srs = mb_ereg_replace("[\']", "[apostrophe]", $srs);

  echo "test 2: '$srs'\n";

  if(mb_strpos($srs, "'") !== false)
    echo "replace failed!";

}  //if($iapops !== false)

Expected result:
----------------
<p><b>酒店位置</b> <br />Hotel - Restaurant Het Ros van Twente位于de Lutte,位于[apostrophe]t [GREEN] Park和Huize Keizer Museum附近。 该 4 星级酒店位于Sandstone Museum of Bad Bentheim和本特海姆城堡地区。</p><p><b>客房</b> <br />
酒店有 30 间客房,提供平板电视。客房设有私人阳台。所提供的卫星电视可满足您的娱乐需求。便利设施包括直拨电话,以及保险>
箱和书桌。</p><p><b>休闲、SPA、高端服务设施</b> <br />享受桑拿等度假设施,或者到花园欣赏美景。</p><p><b>餐饮</b> <br />您可以到餐厅享用一顿美餐;也可以选择酒店的限时客房服务。欢迎光临酒吧/酒廊,点一杯喜欢的饮品,畅饮一番。</p><p><b>商
务及其他服务设施</b> <br />特色服务/设施包括会讲多种语言的服务员、公共区域空调和图书馆。这家酒店的活动设施包括会议室>
、小会议室和宴会设施。酒店提供免费停车设施。</p>

Actual result:
--------------
<p><b>酒店位置</b> <br />Hotel - Restaurant Het Ros van Twente位于de Lutte,位于't [GREEN] Park和Huize Keizer Museum附近。 该 4 星级酒店位于Sandstone Museum of Bad Bentheim和本特海姆城堡地区。</p><p><b>客房</b> <br />
酒店有 30 间客房,提供平板电视。客房设有私人阳台。所提供的卫星电视可满足您的娱乐需求。便利设施包括直拨电话,以及保险
箱和书桌。</p><p><b>休闲、SPA、高端服务设施</b> <br />享受桑拿等度假设施,或者到花园欣赏美景。</p><p><b>餐饮</b> <br />您可以到餐厅享用一顿美餐;也可以选择酒店的限时客房服务。欢迎光临酒吧/酒廊,点一杯喜欢的饮品,畅饮一番。</p><p><b>商
务及其他服务设施</b> <br />特色服务/设施包括会讲多种语言的服务员、公共区域空调和图书馆。这家酒店的活动设施包括会议室>
、小会议室和宴会设施。酒店提供免费停车设施。</p>

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-01-17 09:37 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2014-01-17 09:37 UTC] requinix@php.net
Notice how mb_ereg_replace() never gave you the chance to tell it what encoding the string was in? That would be a job for mb_regex_encoding().

<?php // I ran this from a file encoded in UTF-8

$string = "位于't Goor Park";

// 位于't Goor Park
echo $string, "\n";

// 位于't Goor Park
echo mb_ereg_replace("[\']", "[apostrophe]", $string), "\n";

// 位于[apostrophe]t Goor Park
mb_regex_encoding("UTF-8"); // previous value was EUC-JP
echo mb_ereg_replace("[\']", "[apostrophe]", $string), "\n";

?>
 [2014-01-17 10:12 UTC] hugo at domibay dot es
I was able to reproduce this:

$string = "位于't Goor Park";

// 位于't Goor Park
echo $string, "\n";

// expected: 位于[apostrophe]t Goor Park
echo "encoding: '" . mb_regex_encoding() . "'\n";
echo mb_ereg_replace("[\']", "[apostrophe]", $string), "\n";

// expected 位于\'t Goor Park
mb_regex_encoding("UTF-8"); // previous value was EUC-JP
echo "encoding: '" . mb_regex_encoding() . "'\n";
echo mb_ereg_replace("[\']", "\\'", $string), "\n";

Produced the Output:
位于't Goor Park
encoding: 'EUC-JP'
位于't Goor Park
encoding: 'UTF-8'
位于\'t Goor Park

So I went to my php.ini and changed the "mbstring" Section Options to:
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
mbstring.http_output = UTF-8

Repeating then the Script gave me this new Output.
位于't Goor Park
encoding: 'UTF-8'
位于[apostrophe]t Goor Park
encoding: 'UTF-8'
位于\'t Goor Park

Thank you for your help. 
I couldn't find help on this and I was always thinking that I would have "UTF-8" as internal encoding which I actually didn't have.
 [2014-01-17 10:17 UTC] hugo at domibay dot es
-Status: Feedback +Status: Closed
 [2014-01-17 10:17 UTC] hugo at domibay dot es
The Solution for this unexpected Output is to check on the Configuration at the "mbstring" Section within the php.ini .

Change Default Configuration:
[mbstring]
;mbstring.language = Japanese
;mbstring.internal_encoding = EUC-JP
;mbstring.http_input = auto
;mbstring.http_output = SJIS

To this Configuration that works:
[mbstring]
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
;mbstring.http_input = auto
mbstring.http_output = UTF-8
 [2014-01-17 10:18 UTC] requinix@php.net
-Status: Closed +Status: Not a bug
 [2014-01-17 10:18 UTC] requinix@php.net
Good to hear it's fixed.
 [2014-01-17 10:31 UTC] hugo at domibay dot es
I might have been a bit too quick about the Configuration.

Although the other one worked for me 
this one actually might be more correct:
[mbstring]
mbstring.language = neutral
mbstring.internal_encoding = UTF-8
;mbstring.http_input = auto
mbstring.http_output = auto
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Dec 21 16:01:28 2024 UTC