|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2009-08-24 22:31 UTC] soapergem at gmail dot com
Description:
------------
When text files are saved with UTF-8 encoding, a few characters are saved at the front called the "Byte Order Mark" (read more about it on Wikipedia). They are supposed to remain hidden and just be used as meta-data to indicate that the file is saved with UTF-8 formatting. Their hex values are EF BB BF, which is represented in ASCII by "".
The trouble is that when you read in a UTF-8 text file with either fgets or fgetcsv, PHP misinterprets the BOM as literal text and includes it with all the other text.
Reproduce code:
---------------
<?php
if ( $fp = fopen('ut8_text_file.txt') ) {
echo fgets($fp);
fclose($fp);
}
?>
Expected result:
----------------
Whatever text is saved on the first line of the text file.
Actual result:
--------------
Whatever text is saved on the first line of the text file.
PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Fri Dec 05 21:00:01 2025 UTC |
But generally speaking this isn't desired behavior from the user standpoint. When you open a file in Notepad that has this character at the front, you never see it. I never knew it was there until trying to read it raw through PHP, since it is clearly not intended to be part of the content, but instead part of the file meta-data. It would be unwise not to expect that Unicode will eventually become the standard. And currently the burden is on the PHP user to account for it, when I think it should be a language feature. So I suggest doing something like adding a letter to the "mode" of fopen(), for instance something like this: $fp = fopen('utf8_text_file.txt', 'ru'); The "u" would indicate that the file *may* be encoded in UTF-8, and if so, throw out the BOM at the front. This would mean that fseek'ing to 0 would effectively start just after the BOM (if present), and the file would be initialized to this seek position. This would provide backwards compatibility, since you would have to change the fopen() mode for it to detect the BOM. And it'd make things for PHP users like myself a lot easier. Just a thought.