php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74713 CSV cell split after fputcsv() + fgetcsv() round trip.
Submitted: 2017-06-08 17:14 UTC Modified: 2018-09-12 22:53 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: andreas at dqxtech dot net Assigned: cmb (profile)
Status: Duplicate Package: Filesystem function related
PHP Version: 7.1.5 OS: Linux / 3v4l
Private report: No CVE-ID: None
 [2017-06-08 17:14 UTC] andreas at dqxtech dot net
Description:
------------
If one cell of the data sent to fputcsv() contains "{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}", this cell will be split after a round trip of fputcsv() + fgetcsv().

E.g. with specific choice of delimiter, enclosure and escape character:

Send ['"@@","B"'] to fputcsv() as a row of data.
fgetcsv() gives you back ['"@@', 'B"""'].
https://3v4l.org/3ZUO8

The reason is that fputcsv() writes '"""@@",""B"""' to the file, instead of '"""@@"",""B"""'.

In fact, writing the latter explicitly with fwrite() instead of fputcsv() fixes the problem:
https://3v4l.org/mjQRi


See https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv
Especially this answer, https://stackoverflow.com/a/44441433/246724



Test script:
---------------
$delimiter = ',';
$enclosure = '"';
$escape_char = "@";

$row_before = ["{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}{$enclosure}B{$enclosure}"];

print "\nBEFORE:\n";
var_export($row_before);
print "\n";

$fh = fopen($file = 'php://temp', 'rb+');


fputcsv($fh,$row_before,$delimiter,$enclosure, $escape_char);

# fwrite($fh, '"""@@"",""B"""');

rewind($fh);

$row_plain = fread($fh, 1000);

print "\nPLAIN:\n";
var_export($row_plain);
print "\n";

rewind($fh);

$row_after = fgetcsv($fh, 500,$delimiter,$enclosure, $escape_char);

print "\nAFTER:\n";
var_export($row_after);
print "\n\n";

fclose($fh);

Expected result:
----------------
BEFORE:
array (
  0 => '"@@","B"',
)

PLAIN:
'"""@@"",""B"""
'

AFTER:
array (
  0 => '"@@","B"',
)

Actual result:
--------------
BEFORE:
array (
  0 => '"@@","B"',
)

PLAIN:
'"""@@",""B"""
'

AFTER:
array (
  0 => '"@@',
  1 => 'B"""',
)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-06-12 01:27 UTC] danack@php.net
I looked at fixing the behaviour of the CSV functions before......basically, I'm not sure it is fixable in a way that would be acceptable. Any fix around the behaviour you're seeing will result in a BC break for some existing applications.

I'd recommend using a userland CSV parser instead.
 [2017-06-12 02:45 UTC] andreas at dqxtech dot net
> Any fix around the behaviour you're seeing will result in a BC break for some existing applications.

Afaik, CSV implementations in other languages / platforms do not have a special "escape character", and instead just duplicate all quotes (enclosure chars) in the cell text.

fputcsv() does not seem to support this. Someone on stackoverflow suggested to call fputcsv() with $escape_char === $enclosure, so fputcsv($handle, $fields, ',', '"', '"'). But this did not solve the problem.

Maybe we could allow another value for $escape_char that was previously not allowed. E.g. $escape_char === FALSE or TRUE (pick one). So fputcsv($handle, $fields, ',', '"', FALSE).

If $escape_char is FALSE, fputcsv() could run a standards-compliant behavior.

------

> I'd recommend using a userland CSV parser instead.

I actually did end up writing my own userland CSV parser.
It was very very easy with files I encoded myself, that correctly duplicated all quotes. Here my own fgetcsv() only needed to count the quotes in a piece of data, and see if they are even.

However, it is more tricky with CSV coming from 3rd parties, which is often poor quality, and may have rogue quotes that are not properly duplicated.

All I would ask here is for fgetcsv($handle, $length, ',', '"', FALSE) to reliably decode correct CSV files, and have a sane fallback behavior for incorrect CSV files.

So:
- If a cell does NOT begin with a quote, it ends at the next comma or line break. Any quotes within the cell are read as-is, and considered part of the cell content.
- If a cell DOES begin with a quote, but contains non-duplicate quotes surrounded by text, it ends at the next comma or line break. Any quotes within the cell, and the one at the beginning, are read as-is, and considered part of the cell content.
- If a cell DOES begin with a quote, and does not contain rogue quotes, it ends at the first comma or line break after an even number of quotes.

I hope this makes sense.
- If a cell begins with a quote, we look for the next comma or line break after an even number of quotes.
-
 [2017-06-12 02:49 UTC] andreas at dqxtech dot net
> fputcsv() does not seem to support this.

I should clarify:
fputcsv() does duplicate most of the quotes.
But it always treats the $escape_char in a special way, so we can end up with non-duplicate quotes around escape characters.
There does not seem to be a mode in fputcsv() that does not have an $escape_char.

(Btw the documentation for $escape_char should be improved, I don#t really understand what this parameter intends to do)
 [2017-06-12 02:50 UTC] andreas at dqxtech dot net
> I'd recommend using a userland CSV parser instead.

Maybe you can write the same thing on the stackoverflow question :)
https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv
 [2017-06-12 07:06 UTC] spam2 at rhsoft dot net
Sorry, but "use a user land parser because someone may rely on buggy behavior hence we keep" is a bad attitude
 [2017-09-21 11:27 UTC] cmb@php.net
> Sorry, but "use a user land parser because someone may rely on buggy behavior
> hence we keep" is a bad attitude

Well, this is not really a bug, but rather related to the escape character,
which is a non-standard extension. Removing the escape character may very well
cause a BC break for applications relying on it.

Anyhow, currently, a quite acceptable workaround is to pass "\0" as $escape
argument to fputcsv(). It would be better, though, if one could pass an empty
string or maybe NULL, and perhaps to make that the default in PHP 8.
 [2017-09-21 16:09 UTC] andreas at dqxtech dot net
> It would be better, though, if one could pass an empty
> string or maybe NULL, and perhaps to make that the default in PHP 8.

Yes. So how could we achieve this? Do we need an RFC for this?
 [2017-09-21 16:48 UTC] cmb@php.net
> So how could we achieve this? Do we need an RFC for this?

I've already sent a respective mail to the internals list[1]. If there'll be no
massive objections, I'm planning to pursue the RFC process.

[1] <http://news.php.net/php.internals/100729>
 [2018-09-10 16:17 UTC] theodorejb at outlook dot com
This is still a serious issue which I've run into, and it even affects popular libraries such as League/CSV. See the discussion on PHP internals here: https://externals.io/message/100729#103145.

Would it be possible to move forward with allowing a blank string to be passed to fputcsv() to fix this? To me the fact that fputcsv() doesn't support a blank string for the escape character should be treated as a bug - an RFC should only be needed if we want to change the default behavior in PHP 8.
 [2018-09-12 22:53 UTC] cmb@php.net
-Status: Open +Status: Duplicate -Assigned To: +Assigned To: cmb
 [2018-09-12 22:53 UTC] cmb@php.net
Well, it seems we agree that we can't fix the current behavior for
BC reasons, and that the best way to move on is to allow an empty
string as escape parameter for fgetcsv(), fputcsv() and friends,
so that the whole escaping stuff is ignored/omitted. Then we're
left with this ticket being a duplicate of request #38301 and
request #51496, respectively.
 
PHP Copyright © 2001-2018 The PHP Group
All rights reserved.
Last updated: Tue Sep 25 15:01:25 2018 UTC