php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #48147 iconv with //IGNORE cuts the string
Submitted: 2009-05-04 14:52 UTC Modified: 2015-05-08 07:23 UTC
Votes:16
Avg. Score:4.3 ± 0.8
Reproduced:10 of 12 (83.3%)
Same Version:8 (80.0%)
Same OS:7 (70.0%)
From: kulakov74 at yandex dot ru Assigned: stas (profile)
Status: Closed Package: ICONV related
PHP Version: 5.*, 6CVS (2009-05-05) OS: Linux
Private report: No CVE-ID: None
 [2009-05-04 14:52 UTC] kulakov74 at yandex dot ru
Description:
------------
iconv() without //IGNORE as known cuts the string at the first illegal character, but with //IGNORE it should not. Still, I get a truncated text, but not at the point where the character is. Sorry the actual PHP version is 5.2.6, but I cannot upgrade it. Just to let you know. Can you test that with the last version? Please download the file from http://www.oppcharts.com/iconv.html

Reproduce code:
---------------
$Body1=... //read the file

echo(strlen($Body1)."\n");
$Body2=iconv('UTF-8', 'ISO-8859-1', $Body1);
echo(strlen($Body2)."\n");

$Body2=iconv('UTF-8', 'ISO-8859-1//IGNORE', $Body1);
echo(strlen($Body2)."\n");



Expected result:
----------------
15323
Notice: iconv(): Detected an illegal character in input string in /home/doldon/html/tdnam/dev.php on line 18
3588
-----------------------------------
15323
15321 - I can get this if I use //TRANSLIT or when I run the test on my home Windows PHP 4


Actual result:
--------------
15323
Notice: iconv(): Detected an illegal character in input string in /home/doldon/html/tdnam/dev.php on line 18
3588
-----------------------------------
15323
Notice: iconv(): Detected an illegal character in input string in /home/doldon/html/tdnam/dev.php on line 18
8157 - THIS IS THE PROBLEM

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-05-06 05:13 UTC] kulakov74 at yandex dot ru
Here goes the script. I'm not sure about the limit on external resources - I have the file to convert, so it is downloaded. 

<?php

error_reporting(E_ALL); 

$Body1=file_get_contents("http://www.oppcharts.com/iconv.html");

echo(strlen($Body1)."\n");
$Body2=iconv('UTF-8', 'ISO-8859-1', $Body1);
echo(strlen($Body2)."\n");

$Body2=iconv('UTF-8', 'ISO-8859-1//IGNORE', $Body1);
echo(strlen($Body2)."\n");

?>
 [2009-05-06 14:38 UTC] jani@php.net
It just means you're using glibc iconv implementation which does not 
have the IGNORE parameter implemented.
 [2009-05-06 18:18 UTC] kulakov74 at yandex dot ru
No. The fact the script displays the notice "iconv(): Detected an illegal character ..." in both cases is not related to the fact whether the option is implemented: this is controlled by error_reporting(E_ALL). The option IGNORE only controls whether iconv will stop at the character or not. 

Also, the length of the resulting string is different (greater) with IGNORE, and while without it the string ends at exactly where the illegal character is, with IGNORE it ends at a random point where no such characters occur. 

Also, I did not mention - this is not the only file I converted, many others were converted correctly with the option, and their length only decreased a little. But there were 2 files which were truncated, 1 of them (the smaller) is used for the test case. 

Can you run the test with the latest PHP releases? Actually this is why I reported the bug. I tried it on other servers with PHP 4.3.3, 5.1.4, 5.1.6, 5.2.4 and 5.2.6 and yep! - I finally found one with 5.2.9 (built Feb 27 2009) and it displayed the same results everywhere. 

I repeat, the TRANSLIT option works fine, while it does the same and even more.
 [2009-05-06 18:36 UTC] jani@php.net
Arnaud: Please don't reopen bogus bugs without explanation. 
 [2009-05-07 07:50 UTC] lbarnaud@php.net
Marked it as verified as I got exactly the same results:

The first iconv() call (the one without //IGNORE) fails on the emphasis character "…" (value="Search…"), which can't be represented in ISO-8859-1.

The second iconv() call (the one with //IGNORE) fails later (so the emphasis is ignored, which may means that the //IGNORE flag is supported), and there is no apparent reason for failing at offset 8157 (only regular ASCII chars around).
 [2009-05-07 13:58 UTC] jani@php.net
We still can't fix bugs in glibc iconv implementation. Try this on 
command line and you get same results:

# iconv -f utf-8 -t iso-8859-1 iconv.html > /dev/null
iconv: illegal input sequence at position 3589

# iconv -f utf-8 -t iso-8859-1//IGNORE iconv.html > /dev/null
iconv: illegal input sequence at position 8168

 [2011-12-18 19:37 UTC] ezyang@php.net
Not broken in latest version of libiconv

ezyang@javelin:~/Desktop/libiconv-1.14/src$ ./iconv_no_i18n --version
iconv (GNU libiconv 1.14)
Copyright (C) 2000-2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Bruno Haible.
ezyang@javelin:~/Desktop/libiconv-1.14/src$ ./iconv_no_i18n -f utf-8 -t iso-8859-1//IGNORE ~/iconv.html | wc -c
15312
ezyang@javelin:~/Desktop/libiconv-1.14/src$ iconv -f utf-8 -t iso-8859-1//IGNORE ~/iconv.html | wc -c
iconv: illegal input sequence at position 8168
8157
 [2011-12-23 00:49 UTC] ezyang@php.net
-Status: Bogus +Status: Re-Opened
 [2011-12-23 00:49 UTC] ezyang@php.net
I think I understand how to fix this bug, without modifying glibc. We need to modify our invocation of iconv in order to mirror the behavior of iconv_prog.c:process_block() when the '-c' flag is set (if we mimic the code closely enough, we also get an extra bonus of sensible block processing behavior, which is better than the horrible over-allocation iconv does right now). In particular, we need to handle the EILSEQ error code correctly.
 [2012-01-08 12:33 UTC] pajoye@php.net
-Status: Re-Opened +Status: Feedback
 [2012-01-08 12:33 UTC] pajoye@php.net
To me it looks like there is no bug (as stated in the redhat issues). Also even if 
there was one, it would not be a PHP bug but iconv's.

Or do you have any information that shows that PHP is causing this problem here?
 [2012-10-27 09:26 UTC] ezyang@php.net
I submitted an updated bug to glibc, which correctly describes the incorrect behavior in glibc http://sourceware.org/bugzilla/show_bug.cgi?id=13541

The facts of the matter are as follows:

1) glibc has inconsistent behavior about what the EILSEQ error code is supposed to mean, between its documentation and its behavior
2) glibc and libiconv have different behavior
3) A user of PHP who would like to use iconv to convert between two character sets while ignoring malformed characters *cannot do so* with the most recent versions of PHP (5.4+). (Trust me, I've tried.) In old versions of PHP, this functionality was available. Thus, this bug is a regression.

If you want to blame upstream, that's fine by me, but I'm not optimistic on glibc getting updated any time in the near future, and there is a well understood (and implemented elsewhere) fix which gives us the correct behavior.
 [2013-02-18 00:33 UTC] php-bugs at lists dot php dot net
No feedback was provided. The bug is being suspended because
we assume that you are no longer experiencing the problem.
If this is not the case and you are able to provide the
information that was requested earlier, please do so and
change the status of the bug back to "Open". Thank you.
 [2015-05-08 07:22 UTC] stas@php.net
-Assigned To: +Assigned To: stas
 [2015-05-08 07:23 UTC] stas@php.net
-Status: No Feedback +Status: Assigned
 [2015-05-10 02:29 UTC] stas@php.net
Automatic comment on behalf of stas
Revision: http://git.php.net/?p=php-src.git;a=commit;h=473ec539a1c3d242c8b171dd6a5a98fa17e05c13
Log: Fix #48147 - implement manual handling of  //IGNORE for broken libc
 [2015-05-10 02:29 UTC] stas@php.net
-Status: Assigned +Status: Closed
 [2015-05-10 02:29 UTC] stas@php.net
Automatic comment on behalf of stas
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f8f1d275cfb90744e7387db72ab1857c63c352d8
Log: Fix #48147 - implement manual handling of  //IGNORE for broken libc
 [2016-07-20 11:38 UTC] davey@php.net
Automatic comment on behalf of stas
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f8f1d275cfb90744e7387db72ab1857c63c352d8
Log: Fix #48147 - implement manual handling of  //IGNORE for broken libc
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue Mar 19 08:01:29 2024 UTC