php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #49350 fgets reads the UTF-8 Byte Order Mark literally
Submitted: 2009-08-24 22:31 UTC Modified: 2009-08-25 16:12 UTC
From: soapergem at gmail dot com Assigned:
Status: Not a bug Package: Filesystem function related
PHP Version: 5.3.0 OS: Windows XP
Private report: No CVE-ID: None
 [2009-08-24 22:31 UTC] soapergem at gmail dot com
Description:
------------
When text files are saved with UTF-8 encoding, a few characters are saved at the front called the "Byte Order Mark" (read more about it on Wikipedia). They are supposed to remain hidden and just be used as meta-data to indicate that the file is saved with UTF-8 formatting. Their hex values are EF BB BF, which is represented in ASCII by "".

The trouble is that when you read in a UTF-8 text file with either fgets or fgetcsv, PHP misinterprets the BOM as literal text and includes it with all the other text.

Reproduce code:
---------------
<?php

if ( $fp = fopen('ut8_text_file.txt') ) {

    echo fgets($fp);
    fclose($fp);

}

?>

Expected result:
----------------
Whatever text is saved on the first line of the text file.

Actual result:
--------------
Whatever text is saved on the first line of the text file.

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-08-25 06:44 UTC] jani@php.net
Of course it does. If it didn't, it would be broken.
 [2009-08-25 16:12 UTC] soapergem at gmail dot com
But generally speaking this isn't desired behavior from the user standpoint. When you open a file in Notepad that has this character at the front, you never see it. I never knew it was there until trying to read it raw through PHP, since it is clearly not intended to be part of the content, but instead part of the file meta-data.

It would be unwise not to expect that Unicode will eventually become the standard. And currently the burden is on the PHP user to account for it, when I think it should be a language feature. So I suggest doing something like adding a letter to the "mode" of fopen(), for instance something like this:

$fp = fopen('utf8_text_file.txt', 'ru');

The "u" would indicate that the file *may* be encoded in UTF-8, and if so, throw out the BOM at the front. This would mean that fseek'ing to 0 would effectively start just after the BOM (if present), and the file would be initialized to this seek position. This would provide backwards compatibility, since you would have to change the fopen() mode for it to detect the BOM. And it'd make things for PHP users like myself a lot easier.

Just a thought.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed May 15 06:01:33 2024 UTC