php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #72330 CSV fields incorrectly split if escape char followed by UTF chars
Submitted: 2016-06-03 16:00 UTC Modified: 2018-04-10 17:01 UTC
From: cronfy at gmail dot com Assigned: cmb (profile)
Status: Closed Package: Strings related
PHP Version: Irrelevant OS: Linux Mint 17.1 Rebecca
Private report: No CVE-ID: None
 [2016-06-03 16:00 UTC] cronfy at gmail dot com
Description:
------------
When escape character set for str_getcsv() is followed by some UTF characters, string is parsed incorrectly.

I tested it on php 5.4, 5.5, 5.6 and 7.0 - behavior is the same.

Test script:
---------------
$utf_1 = chr(0xD1) . chr(0x81); // U+0440;
$utf_2   = chr(0xD8) . chr(0x80); // U+0600

$string = '"first #' . $utf_1 . $utf_2 . '";"second one"';
$d = str_getcsv($string, ';', '"', "#");

print_r($d);



Expected result:
----------------
Array
(
    [0] => first #с؀
    [1] => second one
)


Actual result:
--------------
Array
(
    [0] => first #с؀";second one"
)


Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2016-06-14 13:21 UTC] cmb@php.net
-Status: Open +Status: Feedback
 [2016-06-14 13:21 UTC] cmb@php.net
I can't reproduce this behavior, see <https://3v4l.org/QZJSC>.
Maybe it's locale dependend?
 [2016-06-15 00:08 UTC] cmb@php.net
-Assigned To: +Assigned To: cmb
 [2016-06-26 04:22 UTC] php-bugs at lists dot php dot net
No feedback was provided. The bug is being suspended because
we assume that you are no longer experiencing the problem.
If this is not the case and you are able to provide the
information that was requested earlier, please do so and
change the status of the bug back to "Re-Opened". Thank you.
 [2016-07-17 11:07 UTC] cronfy at gmail dot com
-Status: No Feedback +Status: Closed
 [2016-07-17 11:07 UTC] cronfy at gmail dot com
Yes, it it locale dependent. If I prepend script with 'setlocale(LC_ALL, "C")', everything works correctly. But if I don't (my default locale is ru_RU.UTF-8) or if I explicitly use 'setlocale(LC_ALL, "ru_RU")', result is incorrect.

I can't demonstrate it at 3v4l.org as it does not support required locales.
 [2016-07-17 11:08 UTC] cronfy at gmail dot com
I can't change this bug status to Reopened, because I do not have enough privileges.
 [2016-07-17 11:45 UTC] cmb@php.net
-Status: Closed +Status: Re-Opened
 [2016-07-17 11:45 UTC] cmb@php.net
Thanks, cronfy.

Might be related to bug #55507.
 [2016-07-21 16:13 UTC] cmb@php.net
-Summary: str_getcsv() splits fields incorrectly if escape char flollowed by UTF chars +Summary: CSV fields incorrectly split if escape char followed by UTF chars
 [2016-07-21 16:13 UTC] cmb@php.net
This issue does not only affect str_getcsv(), but also other CSV
reading functions, so I've changed the bug title.
 [2016-07-21 17:13 UTC] cmb@php.net
Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f2c2a4be9e466f14677089efe33e20ca0b146809
Log: Fix #72330: CSV fields incorrectly split if escape char followed by UTF chars
 [2016-07-21 17:13 UTC] cmb@php.net
-Status: Re-Opened +Status: Closed
 [2016-10-17 10:10 UTC] bwoebi@php.net
Automatic comment on behalf of cmb
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f2c2a4be9e466f14677089efe33e20ca0b146809
Log: Fix #72330: CSV fields incorrectly split if escape char followed by UTF chars
 [2018-04-09 18:21 UTC] ganlvtech at qq dot com
str_getcsv not correctly work with qouted multibyte character

PHP version: 7.2.2
Operating system: Windows 10 zh-CN

Description:
------------
str_getcsv not correctly work with qouted multibyte characters.

When the multibyte characters are simply seperated by comma, everything seems ok.

If the value contains a quotation mark("), I need to escape quotation mark by doubled quotation mark(""), and quote the value with a pair of quotation mark. And when I try to decode the csv string by str_getcsv, this value will combined with next value (I lost a column and got two value together in one column).

There is not just one type of wrong result. But I think every type of wrong result be caused by the escaped quotation mark.

Bug #72330: CSV fields incorrectly split if escape char followed by UTF chars


Test script:
---------------
<?php
// Test 1
$data = [
    "\xE4\xBD\xA0\xE5\xA5\xBD", // 你好
    "\xE4\xB8\x96\xE7\x95\x8C", // 世界
];
$encoded = implode(',', array_map(function ($value) {
    return '"' . str_replace('"', '""', $value) . '"';
}, $data));
var_dump(str_getcsv($encoded) === $data);

// Test 2
$data = [
    "\"\xE5\x95\x8A", // "啊
];
$encoded = str_putcsv($data);
var_dump(str_getcsv($encoded) === $data);

/** @link https://bugs.php.net/bug.php?id=64183 */
function str_putcsv($fields, $delimiter = ',', $enclosure = '"', $escape_char = '\\') {
    $stream = fopen('php://memory', 'w+');
    fputcsv($stream, $fields, $delimiter, $enclosure, $escape_char);
    rewind($stream);
    return stream_get_contents($stream);
}


Expected result:
----------------
bool(true)
bool(true)


Actual result:
--------------
bool(false)
bool(false)
 [2018-04-09 21:30 UTC] cmb@php.net
-Status: Closed +Status: Re-Opened
 [2018-04-10 10:56 UTC] cmb@php.net
-Status: Re-Opened +Status: Feedback
 [2018-04-10 10:56 UTC] cmb@php.net
The given test script works for me as expected (PHP 7.2.2 on a
German Windows 10).  I guess there are locale related issues in
your case, since fgetcsv() takes into account LC_CTYPE[1].  Try to
set an appropriate UTF-8 locale[2] before calling str_getcsv().

[1] <http://www.php.net/manual/en/function.fgetcsv.php#refsect1-function.fgetcsv-notes>
[2] <http://www.php.net/manual/en/function.setlocale.php>
 [2018-04-10 11:44 UTC] ganlvtech at qq dot com
Thank you very much.

I've also tested the script on Ubuntu 16.04 with php 7.2.2. Test passed.

So the problem may only be reproduced on Windows platform with code page set to a code page with multi-byte chars (e.g. Chinese Simplified cp936).

After searching on the web for half an hour, I found that 'setlocale to utf8 on Windows' is imposible.

There might be a hack for the problem. Set locale to 'en_US' or any locale without multi-byte characters. This works fine, but I don't think it's a good way.

In a word, this function is not binary-safe.
 [2018-04-10 16:48 UTC] cmb@php.net
-Status: Feedback +Status: Re-Opened
 [2018-04-10 16:48 UTC] cmb@php.net
Thanks for further investigating!

> Set locale to 'en_US' or any locale without multi-byte
> characters. This works fine, but I don't think it's a good way.

It appears to be a viable workaround, as long as you are dealing
with valid UTF-8, and the delimiter and enclosing characters are
single-byte.

A cleaner, but less efficient solution would be to convert the
string to a supported encoding, so str_getcsv() works as expected,
and to convert the resulting array elements back to UTF-8.

Either way, the general caveat regarding setlocale() in
multi-threaded environments[1] applies.

So the best solution might be to use a userland implementation
which does not rely on the locale at all, but rather supports
specifying the character encoding of the CSV input.  Adding
something like this to PHP would require the RFC process[2]. 

> In a word, this function is not binary-safe.

In PHP context "binary-safe" usually means that a function
correctly processes strings containing NUL bytes.  str_getcsv()
does this[3], so the function is "binary-safe".

Obviously, you are referring to another defintion of
"binary-safe"[4]. In this sense, str_getcsv() is not
"binary-safe", and actually it can't be, because there are
character encodings where the binary representation of the
delimiter and escape character may be *part* of the binary
representation of other characters.

I'm going to improve the relevant documentation, and will close
this ticket afterwards, since there's not much else which could be
done, unfortunately.

[1] <http://www.php.net/manual/en/function.setlocale.php#refsect1-function.setlocale-notes>
[2] <https://wiki.php.net/rfc/howto>
[3] <https://3v4l.org/qMBEG>
[4] <https://en.wikipedia.org/wiki/Binary-safe>
 [2018-04-10 17:00 UTC] cmb@php.net
Automatic comment from SVN on behalf of cmb
Revision: http://svn.php.net/viewvc/?view=revision&amp;revision=344648
Log: Clarify locale awareness of the CSV reading functions

See bug #72330.
 [2018-04-10 17:01 UTC] cmb@php.net
-Status: Re-Opened +Status: Closed
 [2018-04-10 17:20 UTC] ganlvtech at qq dot com
Thanks for your systematical interperation. This might not be a bug, and it can be closed now. Thank you.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 02:01:28 2024 UTC