php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #65081 new function for replacing ill-formd byte sequences with substitute characters
Submitted: 2013-06-21 03:20 UTC Modified: 2016-10-17 06:33 UTC
Votes:1
Avg. Score:2.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: masakielastic at gmail dot com Assigned: yohgaki (profile)
Status: Closed Package: mbstring related
PHP Version: 5.5.0 OS: All
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: masakielastic at gmail dot com
New email:
PHP Version: OS:

 

 [2013-06-21 03:20 UTC] masakielastic at gmail dot com
Description:
------------
New function for replacing ill-formd byte sequences with substitute characters 
is needed. The problem using mb_convert_encoding for that purpose is that the 
function name doesn't represent the intent.Specfying same encoding twice is 
verbose and can be interpreted as meaningless conversion for the beginners. 

$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');

The case study can be seen in Ruby. Ruby 2.1 introduces String#scrub.

http://bugs.ruby-lang.org/issues/6752
https://github.com/ruby/ruby/blob/1e8a05c1dfee94db9b6b825097e1d192ad32930a/strin
g.c#L7770-L7783

The debate whether the substitute character can be specified or not is needed.

function mb_scrub($str, $encoding = '', $substitute = '')
{
    if ('' === $encoding) {

        $encoding = mb_internal_encoding();

    }

    if ('' === $substutute) {

        $ret = mb_convert_encoding($str, $encoding, $encoding);
       
    } else {

        $before_substitute = mb_substitute_character();
        mb_substitute_character($substitute);
        $ret = mb_convert_encoding($str, $encoding, $encoding);
        mb_substitute_character($before_substitute);

    }

    return $ret;
}

This discussion can be applied to Uconverter.

function uconverter_scrub($str, $encoding, $opts = '')
{
    if ('' === $opts) {
        return UConverter::transcode($str, $encoding, $encoding, $opts);
    } else {
        return UConverter::transcode($str, $encoding, $encoding);
    }
}

The discussion for standard string functions and filter functions may be needed 
since htmlspecialchars can be used for that purpose.

function str_scrub($str, $encoding = 'UTF-8')
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 
$encoding));
}


Patches

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2013-06-22 14:02 UTC] ab@php.net
related to bug #65045 .
 [2013-08-01 08:56 UTC] yohgaki@php.net
-Assigned To: +Assigned To: yohgaki
 [2013-08-01 08:56 UTC] yohgaki@php.net
Assigned to me, so that this report not be forgotten.
 [2016-10-17 06:33 UTC] yohgaki@php.net
-Status: Assigned +Status: Closed
 [2016-10-17 06:33 UTC] yohgaki@php.net
PR is submitted by reporter and merged.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Dec 10 13:01:27 2024 UTC