php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #74713 CSV cell split after fputcsv() + fgetcsv() round trip.
Submitted: 2017-06-08 17:14 UTC Modified: 2017-09-21 16:48 UTC
From: andreas at dqxtech dot net Assigned:
Status: Open Package: Filesystem function related
PHP Version: 7.1.5 OS: Linux / 3v4l
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2017-06-08 17:14 UTC] andreas at dqxtech dot net
Description:
------------
If one cell of the data sent to fputcsv() contains "{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}", this cell will be split after a round trip of fputcsv() + fgetcsv().

E.g. with specific choice of delimiter, enclosure and escape character:

Send ['"@@","B"'] to fputcsv() as a row of data.
fgetcsv() gives you back ['"@@', 'B"""'].
https://3v4l.org/3ZUO8

The reason is that fputcsv() writes '"""@@",""B"""' to the file, instead of '"""@@"",""B"""'.

In fact, writing the latter explicitly with fwrite() instead of fputcsv() fixes the problem:
https://3v4l.org/mjQRi


See https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv
Especially this answer, https://stackoverflow.com/a/44441433/246724



Test script:
---------------
$delimiter = ',';
$enclosure = '"';
$escape_char = "@";

$row_before = ["{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}{$enclosure}B{$enclosure}"];

print "\nBEFORE:\n";
var_export($row_before);
print "\n";

$fh = fopen($file = 'php://temp', 'rb+');


fputcsv($fh,$row_before,$delimiter,$enclosure, $escape_char);

# fwrite($fh, '"""@@"",""B"""');

rewind($fh);

$row_plain = fread($fh, 1000);

print "\nPLAIN:\n";
var_export($row_plain);
print "\n";

rewind($fh);

$row_after = fgetcsv($fh, 500,$delimiter,$enclosure, $escape_char);

print "\nAFTER:\n";
var_export($row_after);
print "\n\n";

fclose($fh);

Expected result:
----------------
BEFORE:
array (
  0 => '"@@","B"',
)

PLAIN:
'"""@@"",""B"""
'

AFTER:
array (
  0 => '"@@","B"',
)

Actual result:
--------------
BEFORE:
array (
  0 => '"@@","B"',
)

PLAIN:
'"""@@",""B"""
'

AFTER:
array (
  0 => '"@@',
  1 => 'B"""',
)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2017-06-12 01:27 UTC] danack@php.net
I looked at fixing the behaviour of the CSV functions before......basically, I'm not sure it is fixable in a way that would be acceptable. Any fix around the behaviour you're seeing will result in a BC break for some existing applications.

I'd recommend using a userland CSV parser instead.
 [2017-06-12 02:45 UTC] andreas at dqxtech dot net
> Any fix around the behaviour you're seeing will result in a BC break for some existing applications.

Afaik, CSV implementations in other languages / platforms do not have a special "escape character", and instead just duplicate all quotes (enclosure chars) in the cell text.

fputcsv() does not seem to support this. Someone on stackoverflow suggested to call fputcsv() with $escape_char === $enclosure, so fputcsv($handle, $fields, ',', '"', '"'). But this did not solve the problem.

Maybe we could allow another value for $escape_char that was previously not allowed. E.g. $escape_char === FALSE or TRUE (pick one). So fputcsv($handle, $fields, ',', '"', FALSE).

If $escape_char is FALSE, fputcsv() could run a standards-compliant behavior.

------

> I'd recommend using a userland CSV parser instead.

I actually did end up writing my own userland CSV parser.
It was very very easy with files I encoded myself, that correctly duplicated all quotes. Here my own fgetcsv() only needed to count the quotes in a piece of data, and see if they are even.

However, it is more tricky with CSV coming from 3rd parties, which is often poor quality, and may have rogue quotes that are not properly duplicated.

All I would ask here is for fgetcsv($handle, $length, ',', '"', FALSE) to reliably decode correct CSV files, and have a sane fallback behavior for incorrect CSV files.

So:
- If a cell does NOT begin with a quote, it ends at the next comma or line break. Any quotes within the cell are read as-is, and considered part of the cell content.
- If a cell DOES begin with a quote, but contains non-duplicate quotes surrounded by text, it ends at the next comma or line break. Any quotes within the cell, and the one at the beginning, are read as-is, and considered part of the cell content.
- If a cell DOES begin with a quote, and does not contain rogue quotes, it ends at the first comma or line break after an even number of quotes.

I hope this makes sense.
- If a cell begins with a quote, we look for the next comma or line break after an even number of quotes.
-
 [2017-06-12 02:49 UTC] andreas at dqxtech dot net
> fputcsv() does not seem to support this.

I should clarify:
fputcsv() does duplicate most of the quotes.
But it always treats the $escape_char in a special way, so we can end up with non-duplicate quotes around escape characters.
There does not seem to be a mode in fputcsv() that does not have an $escape_char.

(Btw the documentation for $escape_char should be improved, I don#t really understand what this parameter intends to do)
 [2017-06-12 02:50 UTC] andreas at dqxtech dot net
> I'd recommend using a userland CSV parser instead.

Maybe you can write the same thing on the stackoverflow question :)
https://stackoverflow.com/questions/44427926/data-gets-garbled-when-writing-to-csv-with-fputcsv-fgetcsv
 [2017-06-12 07:06 UTC] spam2 at rhsoft dot net
Sorry, but "use a user land parser because someone may rely on buggy behavior hence we keep" is a bad attitude
 [2017-09-21 11:27 UTC] cmb@php.net
> Sorry, but "use a user land parser because someone may rely on buggy behavior
> hence we keep" is a bad attitude

Well, this is not really a bug, but rather related to the escape character,
which is a non-standard extension. Removing the escape character may very well
cause a BC break for applications relying on it.

Anyhow, currently, a quite acceptable workaround is to pass "\0" as $escape
argument to fputcsv(). It would be better, though, if one could pass an empty
string or maybe NULL, and perhaps to make that the default in PHP 8.
 [2017-09-21 16:09 UTC] andreas at dqxtech dot net
> It would be better, though, if one could pass an empty
> string or maybe NULL, and perhaps to make that the default in PHP 8.

Yes. So how could we achieve this? Do we need an RFC for this?
 [2017-09-21 16:48 UTC] cmb@php.net
> So how could we achieve this? Do we need an RFC for this?

I've already sent a respective mail to the internals list[1]. If there'll be no
massive objections, I'm planning to pursue the RFC process.

[1] <http://news.php.net/php.internals/100729>
 
PHP Copyright © 2001-2018 The PHP Group
All rights reserved.
Last updated: Mon Jul 16 10:01:53 2018 UTC