php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #66507 string replace does not work on certain points on the string
Submitted: 2014-01-17 08:59 UTC Modified: 2014-01-17 10:31 UTC
From: hugo at domibay dot es Assigned:
Status: Not a bug Package: mbstring related
PHP Version: 5.4.24 OS: Centos 6.4
Private report: No CVE-ID: None
 [2014-01-17 08:59 UTC] hugo at domibay dot es
Description:
------------
I want to introduce certain chinese text into a database. The SQL Query requires that all apostrophes are escaped.
Like "'" -> "\\'" .
As I want to process chinese and other languages that don't use latin characters I opted for a string replace with "mb_ereg_replace()".
I found that "mb_ereg_replace()" could replace latin character names within the Text and also apostrophes between latin character passages but was unable to do that when a chinese character was just before the apostrophes.
Like "位于't Goor Park" -> "位于\\'t Goor Park"
I tried to use this command to achieve this. On many texts it worked, but on this text it failed.
$srs = mb_ereg_replace("[\']", "\\'", $srs);

Curious was that I can detect the apostrophe with "mb_strpos()", but I can't replace it.
$iapops = mb_strpos($srs, "'");

Test script:
---------------
This Script shows that some passages can be replaces but that important apostrophe can't be touched, but yes it can be detected.

$iapops = mb_strpos($srs, "'");

if($iapops !== false)
{
  echo "apostrophe found on '$iapops'\n";

  echo "test 0: '$srs'\n";

  $srs = mb_ereg_replace("(goor)", "[GREEN]", $srs, "i");

  echo "test 1: '$srs'\n";

  $srs = mb_ereg_replace("[\']", "[apostrophe]", $srs);

  echo "test 2: '$srs'\n";

  if(mb_strpos($srs, "'") !== false)
    echo "replace failed!";

}  //if($iapops !== false)

Expected result:
----------------
<p><b>酒店位置</b> <br />Hotel - Restaurant Het Ros van Twente位于de Lutte,位于[apostrophe]t [GREEN] Park和Huize Keizer Museum附近。 该 4 星级酒店位于Sandstone Museum of Bad Bentheim和本特海姆城堡地区。</p><p><b>客房</b> <br />
酒店有 30 间客房,提供平板电视。客房设有私人阳台。所提供的卫星电视可满足您的娱乐需求。便利设施包括直拨电话,以及保险>
箱和书桌。</p><p><b>休闲、SPA、高端服务设施</b> <br />享受桑拿等度假设施,或者到花园欣赏美景。</p><p><b>餐饮</b> <br />您可以到餐厅享用一顿美餐;也可以选择酒店的限时客房服务。欢迎光临酒吧/酒廊,点一杯喜欢的饮品,畅饮一番。</p><p><b>商
务及其他服务设施</b> <br />特色服务/设施包括会讲多种语言的服务员、公共区域空调和图书馆。这家酒店的活动设施包括会议室>
、小会议室和宴会设施。酒店提供免费停车设施。</p>

Actual result:
--------------
<p><b>酒店位置</b> <br />Hotel - Restaurant Het Ros van Twente位于de Lutte,位于't [GREEN] Park和Huize Keizer Museum附近。 该 4 星级酒店位于Sandstone Museum of Bad Bentheim和本特海姆城堡地区。</p><p><b>客房</b> <br />
酒店有 30 间客房,提供平板电视。客房设有私人阳台。所提供的卫星电视可满足您的娱乐需求。便利设施包括直拨电话,以及保险
箱和书桌。</p><p><b>休闲、SPA、高端服务设施</b> <br />享受桑拿等度假设施,或者到花园欣赏美景。</p><p><b>餐饮</b> <br />您可以到餐厅享用一顿美餐;也可以选择酒店的限时客房服务。欢迎光临酒吧/酒廊,点一杯喜欢的饮品,畅饮一番。</p><p><b>商
务及其他服务设施</b> <br />特色服务/设施包括会讲多种语言的服务员、公共区域空调和图书馆。这家酒店的活动设施包括会议室>
、小会议室和宴会设施。酒店提供免费停车设施。</p>

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2014-01-17 09:37 UTC] requinix@php.net
-Status: Open +Status: Feedback
 [2014-01-17 09:37 UTC] requinix@php.net
Notice how mb_ereg_replace() never gave you the chance to tell it what encoding the string was in? That would be a job for mb_regex_encoding().

<?php // I ran this from a file encoded in UTF-8

$string = "位于't Goor Park";

// 位于't Goor Park
echo $string, "\n";

// 位于't Goor Park
echo mb_ereg_replace("[\']", "[apostrophe]", $string), "\n";

// 位于[apostrophe]t Goor Park
mb_regex_encoding("UTF-8"); // previous value was EUC-JP
echo mb_ereg_replace("[\']", "[apostrophe]", $string), "\n";

?>
 [2014-01-17 10:12 UTC] hugo at domibay dot es
I was able to reproduce this:

$string = "位于't Goor Park";

// 位于't Goor Park
echo $string, "\n";

// expected: 位于[apostrophe]t Goor Park
echo "encoding: '" . mb_regex_encoding() . "'\n";
echo mb_ereg_replace("[\']", "[apostrophe]", $string), "\n";

// expected 位于\'t Goor Park
mb_regex_encoding("UTF-8"); // previous value was EUC-JP
echo "encoding: '" . mb_regex_encoding() . "'\n";
echo mb_ereg_replace("[\']", "\\'", $string), "\n";

Produced the Output:
位于't Goor Park
encoding: 'EUC-JP'
位于't Goor Park
encoding: 'UTF-8'
位于\'t Goor Park

So I went to my php.ini and changed the "mbstring" Section Options to:
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
mbstring.http_output = UTF-8

Repeating then the Script gave me this new Output.
位于't Goor Park
encoding: 'UTF-8'
位于[apostrophe]t Goor Park
encoding: 'UTF-8'
位于\'t Goor Park

Thank you for your help. 
I couldn't find help on this and I was always thinking that I would have "UTF-8" as internal encoding which I actually didn't have.
 [2014-01-17 10:17 UTC] hugo at domibay dot es
-Status: Feedback +Status: Closed
 [2014-01-17 10:17 UTC] hugo at domibay dot es
The Solution for this unexpected Output is to check on the Configuration at the "mbstring" Section within the php.ini .

Change Default Configuration:
[mbstring]
;mbstring.language = Japanese
;mbstring.internal_encoding = EUC-JP
;mbstring.http_input = auto
;mbstring.http_output = SJIS

To this Configuration that works:
[mbstring]
mbstring.language = UTF-8
mbstring.internal_encoding = UTF-8
;mbstring.http_input = auto
mbstring.http_output = UTF-8
 [2014-01-17 10:18 UTC] requinix@php.net
-Status: Closed +Status: Not a bug
 [2014-01-17 10:18 UTC] requinix@php.net
Good to hear it's fixed.
 [2014-01-17 10:31 UTC] hugo at domibay dot es
I might have been a bit too quick about the Configuration.

Although the other one worked for me 
this one actually might be more correct:
[mbstring]
mbstring.language = neutral
mbstring.internal_encoding = UTF-8
;mbstring.http_input = auto
mbstring.http_output = auto
 
PHP Copyright © 2001-2022 The PHP Group
All rights reserved.
Last updated: Sun Jan 23 20:03:35 2022 UTC