|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2009-06-09 14:18 UTC] krynble at yahoo dot com dot br
Description: ------------ Problem using fgetcsv ignoring special characters at the begining of a string. The example I had was using the word "?TICA" with the "#" character as separator. Reproduce code: --------------- Consider a file with the following contents: WEIRD#?TICA#BEHAVIOR When using fgetcsv to parse this file, I get an output like this: Array( [0] => WEIRD, [1] => TICA, [2] => BEHAVIOR ) Expected result: ---------------- Array( [0] => WEIRD, [1] => ?TICA, [2] => BEHAVIOR ) Actual result: -------------- Array( [0] => WEIRD, [1] => TICA, [2] => BEHAVIOR ) PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sun Nov 02 12:00:01 2025 UTC |
below you'll find a small script which shows how to implement a user filter that can be used to on-the-fly utf8-encode the data so that fgetcsv is happy and returns correct output even if the first character in a field has its high-bit set and is not valid utf-8: Remember: This is a workaround and impacts performance. This is not a valid fix for the bug. I didn't yet have time to deeply look into the C implementation for fgetcsv, but all these calls to php_mblen() feel suspicious to me. I'll try and have a look into this later today, but for now, I'm just glad I have this workaround (quickly hacked together - keep that in mind): <?php class utf8encode_filter extends php_user_filter { function is_utf8($string){ return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )+%xs', $string); } function filter($in, $out, &$consumed, $closing) { while ($bucket = stream_bucket_make_writeable($in)) { if (!$this->is_utf8($bucket->data)) $bucket->data = utf8_encode($bucket->data); $consumed += $bucket->datalen; stream_bucket_append($out, $bucket); } return PSFS_PASS_ON; } } /* Register our filter with PHP */ stream_filter_register("utf8encode", "utf8encode_filter") or die("Failed to register filter"); $fp = fopen($_SERVER['argv'][1], "r"); /* Attach the registered filter to the stream just opened */ stream_filter_prepend($fp, "utf8encode"); while($data = fgetcsv($fp, 0, ';', '"')) print_r($data); fclose($fp);The problem does also appears if the special char is preceded by a blank. This blank also disappears. I use this ugly workaround: 1. first reading the complete csv file into a variable: $import 2. $import = preg_replace ("{(^|\t)([€-ÿ ])}m", "$1~~$2", $import); 3. after fgetcsv; for each $field of the row array: $field = str_replace ('~~', '', $field); This means: before using fgetcsv inserting a magic sequence (e.g. ~~) on the beginning of a field which begins with a blank or a special char; after parsing with fgetcsv removing it from each field. Max.Confirmed with php5 (5.3.6-13ubuntu3.2 on Oneiric Ocelot); can be worked around by quoting the value with quotation marks. For example, the line a,"a",é,"é",óú,"óú",ó&ú,"ó&ú" yields array ( 0 => 'a', 1 => 'a', 2 => '', 3 => 'é', 4 => '', 5 => 'óú', 6 => '&ú', 7 => 'ó&ú', ) Note the corruption in elements 2, 4, and 6, but not in their quoted counterparts 3, 5, and 7.eswald@middil, I am not able to reproduce your results with either en_US.UTF-8 nor C with a UTF8 input file: ~> echo $LANG en_US.UTF-8 ~> file utf8.txt utf8.txt: UTF-8 Unicode text ~> cat utf8.txt a,"a",é,"é",óú,"óú",ó&ú,"ó&ú" ~> php -r "print_r(fgetcsv(fopen('./utf8.txt','r')));" Array ( [0] => a [1] => a [2] => é [3] => é [4] => óú [5] => óú [6] => ó&ú [7] => ó&ú ) I don't see any corruption. I can understand problems with charsets that are not low-ascii compatible with a low-ascii delimiter, but I don't see why this UTF8 case would break.