php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #65081 new function for replacing ill-formd byte sequences with substitute characters
Submitted: 2013-06-21 03:20 UTC Modified: 2016-10-17 06:33 UTC
Votes:1
Avg. Score:2.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: masakielastic at gmail dot com Assigned: yohgaki (profile)
Status: Closed Package: mbstring related
PHP Version: 5.5.0 OS: All
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: masakielastic at gmail dot com
New email:
PHP Version: OS:

 

 [2013-06-21 03:20 UTC] masakielastic at gmail dot com
Description:
------------
New function for replacing ill-formd byte sequences with substitute characters 
is needed. The problem using mb_convert_encoding for that purpose is that the 
function name doesn't represent the intent.Specfying same encoding twice is 
verbose and can be interpreted as meaningless conversion for the beginners. 

$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');

The case study can be seen in Ruby. Ruby 2.1 introduces String#scrub.

http://bugs.ruby-lang.org/issues/6752
https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/strin
g.c#L7770-L7783

The debate whether the substitute character can be specified or not is needed.

function mb_scrub($str, $encoding = '', $substitute = '')
{
    if ('' === $encoding) {

        $encoding = mb_internal_encoding();

    }

    if ('' === $substutute) {

        $ret = mb_convert_encoding($str, $encoding, $encoding);
       
    } else {

        $before_substitute = mb_substitute_character();
        mb_substitute_character($substitute);
        $ret = mb_convert_encoding($str, $encoding, $encoding);
        mb_substitute_character($before_substitute);

    }

    return $ret;
}

This discussion can be applied to Uconverter.

function uconverter_scrub($str, $encoding, $opts = '')
{
    if ('' === $opts) {
        return UConverter::transcode($str, $encoding, $encoding, $opts);
    } else {
        return UConverter::transcode($str, $encoding, $encoding);
    }
}

The discussion for standard string functions and filter functions may be needed 
since htmlspecialchars can be used for that purpose.

function str_scrub($str, $encoding = 'UTF-8')
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
$encoding));
}


Patches

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-06-22 14:02 UTC] ab@php.net
related to bug #65045 .
 [2013-08-01 08:56 UTC] yohgaki@php.net
-Assigned To: +Assigned To: yohgaki
 [2013-08-01 08:56 UTC] yohgaki@php.net
Assigned to me, so that this report not be forgotten.
 [2016-10-17 06:33 UTC] yohgaki@php.net
-Status: Assigned +Status: Closed
 [2016-10-17 06:33 UTC] yohgaki@php.net
PR is submitted by reporter and merged.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Sep 19 22:01:26 2024 UTC