php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #48507 fgetcsv() ignoring special characters
Submitted: 2009-06-09 14:18 UTC Modified: 2012-02-13 05:16 UTC
Votes:10
Avg. Score:4.5 ± 0.7
Reproduced:10 of 10 (100.0%)
Same Version:8 (80.0%)
Same OS:8 (80.0%)
From: krynble at yahoo dot com dot br Assigned:
Status: Not a bug Package: Filesystem function related
PHP Version: 5.* OS: Unix
Private report: No CVE-ID: None
 [2009-06-09 14:18 UTC] krynble at yahoo dot com dot br
Description:
------------
Problem using fgetcsv ignoring special characters at the begining of a 
string.

The example I had was using the word "?TICA" with the "#" character as 
separator.

Reproduce code:
---------------
Consider a file with the following contents: WEIRD#?TICA#BEHAVIOR

When using fgetcsv to parse this file, I get an output like this:

Array(
   [0] => WEIRD,
   [1] => TICA,
   [2] => BEHAVIOR
)

Expected result:
----------------
Array(
   [0] => WEIRD,
   [1] => ?TICA,
   [2] => BEHAVIOR
)

Actual result:
--------------
Array(
   [0] => WEIRD,
   [1] => TICA,
   [2] => BEHAVIOR
)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-06-10 12:47 UTC] jani@php.net
Please try using this CVS snapshot:

  http://snaps.php.net/php5.2-latest.tar.gz
 
For Windows:

  http://windows.php.net/snapshots/


 [2009-06-13 18:10 UTC] krynble at yahoo dot com dot br
Unfortunately I'm unable to test it because the server is running in a 
Datacenter.

If someone can give a feedback about it, I would apreciate.

Still, thanks for the help!
 [2009-06-26 19:35 UTC] sjoerd-php at linuxonly dot nl
Could reproduce with php 5.2.10, php 5.2.11-dev (200906261830) and php 5.3rc4. Example code:

<?php
$fp = tmpfile();
$str = "WEIRD#\xD3TICA#BEHAVIOR";
fwrite($fp, $str);
fseek($fp, 0);
$arr = fgetcsv($fp, 100, '#');
var_dump($arr[1]);
fclose($fp);
?>

Expected: string(5) "?TICA"
Actual: string(4) "TICA"
 [2009-09-21 18:07 UTC] dmulryan at calendarwiz dot com
Similar problem when parsing the following line:

0909211132,1,??????,????,CForm,Y,1,1,1,97.95.176.240,2530

which produces empty array elements for fields with special characters:

Array ( [0] => 0909211132 [1] => 1 [2] => [3] => [4] => URL [5] => Y [6] => 1 [7] => 1 [8] => 1 [9] => 97.95.176.240 [10] => 2530 )
 [2009-09-21 18:11 UTC] dmulryan at calendarwiz dot com
Note: Previous comment has error where URL is shown in array element.  This is not a bug but my error in the example.  Bug is in special characters.
 [2009-09-22 14:45 UTC] phofstetter at sensational dot ch
I was looking into this (after having been bitten by it) and I can add another tidbit that might help tracking this down:

The bug doesn't happen if the file fgetcsv() is reading is in UTF-8-format.

I have created a test-file in ISO-8859-1 and then used file_put_contents(utf8encode(file_get_contents())) to create the UTF8-version of it (explaining this here because I'm not sure whether this would write a BOM or not - probably not though).

That version could be read correctly.

I'm now writing a stream filter that does the UTF-8 conversion on the fly to hook that in between the file and fgetcsv() - while I would lose a bit of performance, in my case, this is the cleanest workaround.
 [2009-09-22 15:09 UTC] phofstetter at sensational dot ch
below you'll find a small script which shows how to implement a user filter that can be used to on-the-fly utf8-encode the data so that fgetcsv is happy and returns correct output even if the first character in a field has its high-bit set and is not valid utf-8:

Remember: This is a workaround and impacts performance. This is not a valid fix for the bug.

I didn't yet have time to deeply look into the C implementation for fgetcsv, but all these calls to php_mblen() feel suspicious to me.

I'll try and have a look into this later today, but for now, I'm just glad I have this workaround (quickly hacked together - keep that in mind):

<?php

class utf8encode_filter extends php_user_filter {
  function is_utf8($string){
      return preg_match('%(?:
          [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
          |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
          |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
          |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
          |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
          |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
          |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
      )+%xs', $string);
  }
      
  function filter($in, $out, &$consumed, $closing)
  {
    while ($bucket = stream_bucket_make_writeable($in)) {
      if (!$this->is_utf8($bucket->data))
          $bucket->data = utf8_encode($bucket->data);
      $consumed += $bucket->datalen;
      stream_bucket_append($out, $bucket);
    }
    return PSFS_PASS_ON;
  }
}

/* Register our filter with PHP */
stream_filter_register("utf8encode", "utf8encode_filter")
    or die("Failed to register filter");

$fp = fopen($_SERVER['argv'][1], "r");

/* Attach the registered filter to the stream just opened */
stream_filter_prepend($fp, "utf8encode");

while($data = fgetcsv($fp, 0, ';', '"'))
    print_r($data);

fclose($fp);
 [2009-12-12 01:33 UTC] jani@php.net
See also bug #50456
 [2009-12-12 11:40 UTC] pahan at hubbitus dot spb dot su
Sorry for duplicate (#50456 is my), but in it, additionally to there described problem in fgetcsv I also suggest fix fputcvs to allow [force] enclosing single words in field.

Off course it does *not* solve this problem of incorrect fgetcsv parsing, because RFC allow not quoted values ( http://www.faqs.org/rfcs/rfc4180.html , section 2.5 ), but, it is make pair fputcsv/fgetcsv as minimum compatible in PHP implementation.
 [2010-05-18 11:03 UTC] mike@php.net
-Status: Verified +Status: Bogus
 [2010-05-18 11:03 UTC] mike@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Quote from the docs:

Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function.
 [2010-05-19 13:39 UTC] pahan at hubbitus dot spb dot su
> Quote from the docs:
> Note: Locale setting is taken into account by this function. If LANG is e.g.
> en_US.UTF-8, files in one-byte encoding are read wrong by this function.
Ok, bug documented as "are read wrong by this function" is better then nothing. 
But do you plan fix this wrong behaviour?
 [2011-02-26 02:36 UTC] gjorgjioski at gmail dot com
This bug occurs also when file is in UTF8 (tab delimited file using š,č characters). I can provide an example.
 [2011-02-26 02:46 UTC] gjorgjioski at gmail dot com
This is short example:

kategorija	širina platišč	število

read:
kategorija
irina platišč
tevilo

expected:
kategorija
širina platišč
število
 [2011-07-08 08:39 UTC] php-bug-48507 at bsrealm dot net
This IS a bug. Whatever locale is, I expect this function to read everything between delimiter characters without stripping the contents. Besides, docs say that files in one-byte encoding would read wrong, and there is a different case. This bug causes serious portability issue. In my case, this function was used to read custom database that was storing descriptions entered by users. Some descriptions were in utf-8 enconding. Function just had to read whatever was between delimiter characters and it worked like that on Windows hosting and stopped working after moving to Unix hosting. Note, file itself is not utf-8 encoded and it should not be. It is not related to locale. It must read data, even if it's binary, between delimiters.
 [2011-07-17 16:19 UTC] max dot wildgrube at web dot de
The problem does also appears if the special char is preceded by a blank. This blank also disappears.

I use this ugly workaround:
1. first reading the complete csv file into a variable: $import
2. $import = preg_replace ("{(^|\t)([€-ÿ ])}m", "$1~~$2", $import); 
3. after fgetcsv; for each $field of the row array: $field = str_replace ('~~', '', $field);

This means: before using fgetcsv inserting a magic sequence (e.g. ~~) on the beginning of a field which begins with a blank or a special char; after parsing with fgetcsv removing it from each field.

Max.
 [2011-10-10 10:03 UTC] ghosh at q-one dot com
Sorry. I don't understand why this isn't a bug either. Could someone please elaborate? I tried setting all different kinds of locale to no avail. The first letter of a string starting with a UTF-8 character is always missing. IMHO, fgetcsv should work as a simple string operation (or - whatever weird things it does right now - at least have a parameter to do so - count this as a feature request if you wish). I think, the current behavior is totally confusing. For instance, I don't understand why only the first character is missing but the problem doesnt appear if a character is in the middle of a string.
 [2011-10-18 13:59 UTC] me at monicag dot it
Quoting my fellows above: how comes this is not a bug?
 [2011-10-28 08:33 UTC] peter dot e dot lind at gmail dot com
This is definitely still a bug - my locale is set to da_DK.utf8, the file I'm 
trying to read is in UTF8 (confirmed with a hex-editor but in fact does not 
matter - the behaviour is the same, UTF8 or ISO-8859-1) yet special characters 
are still thrown away when they are first in a field
 [2012-01-18 11:53 UTC] tero dot tasanen at gmail dot com
I can also confirm that this is an actual bug. File encoding UTF-8, locale 
settings are set correctly and characters like äöå are dropped from the beginning 
of the csv column. 

Tested with php versions 5.2.6, 5.2.10, 5.3.6
 [2012-01-26 19:50 UTC] eswald at middil dot com
Confirmed with php5 (5.3.6-13ubuntu3.2 on Oneiric Ocelot); can be worked around 
by quoting the value with quotation marks.  For example, the line

    a,"a",é,"é",óú,"óú",ó&ú,"ó&ú"

yields

    array (
      0 => 'a',
      1 => 'a',
      2 => '',
      3 => 'é',
      4 => '',
      5 => 'óú',
      6 => '&ú',
      7 => 'ó&ú',
    )

Note the corruption in elements 2, 4, and 6, but not in their quoted 
counterparts 3, 5, and 7.
 [2012-01-26 19:55 UTC] eswald at middil dot com
Tested with LANG=C, input file encoding of UTF-8.
Also tested with LANG=C, input file encoding of cp1252, with identical results, 
except that the output characters (what was left of them) were also cp1252.
 [2012-02-13 01:46 UTC] figura at hotbox dot ru
setlocale() might solve the issue but I do not see any reason to set up dependence of this fgetcsv on locale settings. The format is straight and clear. 

Especially this "feature" confuses when the string is read in UTF-8 format.
 [2012-02-13 05:16 UTC] rasmus@php.net
eswald@middil, I am not able to reproduce your results with either en_US.UTF-8 
nor C with a UTF8 input file:

~> echo $LANG
en_US.UTF-8
~> file utf8.txt
utf8.txt: UTF-8 Unicode text
~> cat utf8.txt 
a,"a",é,"é",óú,"óú",ó&ú,"ó&ú"
~> php -r "print_r(fgetcsv(fopen('./utf8.txt','r')));"
Array
(
    [0] => a
    [1] => a
    [2] => é
    [3] => é
    [4] => óú
    [5] => óú
    [6] => ó&ú
    [7] => ó&ú
)

I don't see any corruption. I can understand problems with charsets that are not  
low-ascii compatible with a low-ascii delimiter, but I don't see why this UTF8 
case would break.
 [2012-02-27 16:07 UTC] jamie dot kahgee at gmail dot com
rasmus@php, eswald@middil

I had the same problem, running fgetcsv from CLI showed no error and everything 
worked and output as expected.  It was when I ran from through APACHE that I 
couldn't get my output to display. (same script, same file).

(Ü) was the specific character I was dealing with at the start of a string that 
was not showing.

After I tried setting my locale local in the script everything worked as 
expected through APACHE and my strings started parsing and displaying correctly.

setlocale(LC_ALL, 'en_US.UTF-8');

Hopefully this can help you.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 18:01:28 2024 UTC