php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Doc Bug #13950 fgetcsv misinterprets fields containing double-quotes
Submitted: 2001-11-06 00:23 UTC Modified: 2001-12-17 16:44 UTC
From: liamr at umich dot edu Assigned:
Status: Closed Package: Documentation problem
PHP Version: 4.0.6 / 4.20cvs OS: Sun (SPARC) Solaris 2.6
Private report: No CVE-ID: None
 [2001-11-06 00:23 UTC] liamr at umich dot edu
I'm trying to write a script to interpret pine-style addressbooks, which are pretty much just full tab-seperated values..  It doesn't handle fields with double quotes correctly.  If the field starts with a quotes, it strips them, and removes any data that follows the closing quote for that field.  This code and data will replicated the problem.

<?php

$row = 1;
$fh = fopen("./addressbook","r");
while ($data = fgetcsv ($fh, 1000, "\t")) {
    $num = count($data);
    $match = preg_grep("/(#|\(|\))/", $data);
    if ( ($num < 3) || (count($match) >=1) ) {
        continue;
    }

    print "\n$num fields in line $row:\n";
    $row++;
    for ($c=0; $c<$num; $c++) {
        print $data[$c] . "\n";
    }
}
fclose ($fh);

?>

"addressbook" is a tab separated file containing data along the lines of:

liamr   Hoekenga, Liam  liamr@umich.edu
liam    "Liam Hoekenga" liam.hoekenga@umich.edu
lhoek   "Liam H."       "Liam R. Hoekenga" <liamr@umich.edu>
lrh     ME "ME" ME      liamr@umich.edu
lrh1    "ME" ME liamr@umich.edu

produces this output:
3 fields in line 1:
liamr
Hoekenga, Liam
liamr@umich.edu

3 fields in line 2:
liam
Liam Hoekenga
liam.hoekenga@umich.edu

3 fields in line 3:
lhoek
Liam H.
Liam R. Hoekenga

3 fields in line 4:
lrh
ME "ME" ME
liamr@umich.edu

3 fields in line 5:
lrh1
ME
liamr@umich.edu

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2001-11-08 14:34 UTC] liamr at umich dot edu
I tried it with the copy of PHP in CVS.  It's a problem in that version too.
 [2001-11-08 15:33 UTC] jeroen@php.net
This is exactly what fgetcsv is for (interpreting the double quotes in CSV files). Use explode if you simply want to split on tabs.

No bug.

--Jeroen
 [2001-11-08 16:22 UTC] liamr at umich dot edu
How is this /not/ a bug?

It does the same thing indepentdent of delimiter.

If I have a CSV file that looks like
one,"two" three,four
one,two" three",four
one,two,three,four
one,two "three" four,five

and it gets read in as
('one', 'two', 'four')
('one',' two "three"', 'four')
('one', 'two', 'three', 'four')
('one', 'two "three" four', 'five')

That behavoir does not match the function definition of fgetcsv in the manual.  it is mishandling the data. 

"fgetcsv --  Gets line from file pointer and parse for CSV fields 

Description

    array fgetcsv (int fp, int length, string [delimiter])
 
Similar to fgets() except that fgetcsv() parses the line it reads for fields in CSV format and returns an array containing the fields read. The field delimiter is a comma, unless you specify another delimiter with the optional third parameter. "

How can that possibly mean "strips off double quotes and removes subsequent data"?  I'd expect it to split on the delimiter specified, and to file the resulting information into the array.

 [2001-11-08 16:32 UTC] jeroen@php.net
The CSV format prescribes that fields may be enclosed in double quotes, to make it possible to have the delimiter itself part of a field.

try google on CSV, to find a definition. If you find one, we'd appreciate it if you entered a link here, so we can make the documentation better.

Changing to documentation bug (indeed, the docs aren't clear about this)
 [2001-11-08 16:32 UTC] derick@php.net
What output did you expect then from your example?

Derick
 [2001-11-08 17:01 UTC] jeroen@php.net
"It's important to note that while just about everyone thinks they know what the CSV file format is, there is actually no formal definition of the format and there are many subtle differences. Here's one description of a CSV file: 


  http://www.whatis.com/csvfile.htm

"

Source: http://dbi.symbolstone.org/cgi/summarydump?module=DBD::CSV

That whatis link is broken, see http://whatis.techtarget.com/definition/0,,sid9_gci213871,00.html

On several sites I encoutered the same (incomplete and vague) BNF definition, original source unknown:

<CSV_file> ::= { <CSV_line> }
<CSV_line> ::= <value> { "," <value> } <spaces_and_tabs> <CRLF>
<value> ::= <spaces_and_tabs>
        (
          { <any_text_except_quotas_and_commas_and_smth_else> }
        | <single_or_double_quote>
          <any_text_save_CRLF_with_corresponding_doubled_quotas>
          <the_same_quote>
        )

That's all I think... and there is some problem with this format:
different database systems have different definitions of the
term <any_text_except_quotas_and_commas_and_smth_else> :)

[Found, amongst others, on http://myfileformats.com/search.php?name=CSV]

Changed status again because of accidentel cross-update, and assigning to myself.
 [2001-12-17 16:44 UTC] derick@php.net
Closing due to no feedback
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Sep 18 16:01:27 2024 UTC