php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #81362 iconv UTF-8//IGNORE fails to strip "\xF5\x80\x80\x80"
Submitted: 2021-08-15 15:34 UTC Modified: 2021-08-16 12:44 UTC
From: divinity76 at gmail dot com Assigned:
Status: Open Package: ICONV related
PHP Version: 8.0.9 OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: divinity76 at gmail dot com
New email:
PHP Version: OS:

 

 [2021-08-15 15:34 UTC] divinity76 at gmail dot com
Description:
------------
it does reproduce on "PHP8.0.3 + iconv 2.28 + Ubuntu 20.04",
it also reproduce on 3v4l.org everywhere from PHP5.6.0 to PHP8.0.9 inclusive, but unknown iconv/OS: https://3v4l.org/1b4G8

(the string also appears to trigger a bug in mb_check_encoding() that was fixed in 5.6.0?)

interestingly it does *not* reproduce on "PHP7.3.7-for-cygwin + iconv 1.0 + Windows 10", there it correctly strip the string down to string(3) "PHP"

it's possible that it's the result of a bug introduced sometime after iconv 1.0 and <= iconv 2.28?

Test script:
---------------
<?php
$invalid_utf8="\xF5\x80\x80\x80PHP";
$should_be_valid_utf8=iconv("UTF-8","UTF-8//IGNORE",$invalid_utf8);

var_dump($should_be_valid_utf8,bin2hex($should_be_valid_utf8),mb_check_encoding($should_be_valid_utf8));


Expected result:
----------------
string(3) "PHP"
string(6) "504850"
bool(true)


Actual result:
--------------
string(7) "����PHP"
string(14) "f5808080504850"
bool(false)


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2021-08-15 15:42 UTC] cmb@php.net
-Type: Bug +Type: Documentation Problem
 [2021-08-15 15:42 UTC] cmb@php.net
The latest GNU libiconv version is 1.16[1].  The iconv()
documentation[2] already warns about //TRANSLIT being
implementation specific; the same applies to //IGNORE.  I suggest
that you always build against GNU libiconv; system implementations
are known to be limited (or even buggy).

So no, not an implementation bug, but rather a doc issue.

[1] <https://www.gnu.org/software/libiconv/>
[2] <https://www.php.net/manual/en/function.iconv.php>
 [2021-08-15 16:02 UTC] divinity76 at gmail dot com
@cmb that's interesting, any idea where the deb.sury.org builds get their "iconv version 2.28" from?

but it's not just that they ignore the //IGNORE directive, they understand the "//IGNORE" directive and yet doesn't strip it. for example they all correctly strips "\xf5\x80\x80PHP" down to "PHP", so the //IGNORE isn't being ignored.
 [2021-08-16 12:44 UTC] nikic@php.net
Presumably iconv (or at least this iconv implementation) doesn't consider encoded codepoints > U+10FFFF to be invalid.
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Wed Oct 20 16:03:42 2021 UTC