PHP :: Bug #49350 :: fgets reads the UTF-8 Byte Order Mark literally

Bug #49350	fgets reads the UTF-8 Byte Order Mark literally
Submitted:	2009-08-24 22:31 UTC	Modified:	2009-08-25 16:12 UTC
From:	soapergem at gmail dot com	Assigned:
Status:	Not a bug	Package:	Filesystem function related
PHP Version:	5.3.0	OS:	Windows XP
Private report:	No	CVE-ID:	None

View Developer Edit

[2009-08-24 22:31 UTC] soapergem at gmail dot com

Description:
------------
When text files are saved with UTF-8 encoding, a few characters are saved at the front called the "Byte Order Mark" (read more about it on Wikipedia). They are supposed to remain hidden and just be used as meta-data to indicate that the file is saved with UTF-8 formatting. Their hex values are EF BB BF, which is represented in ASCII by "".

The trouble is that when you read in a UTF-8 text file with either fgets or fgetcsv, PHP misinterprets the BOM as literal text and includes it with all the other text.

Reproduce code:
---------------
<?php

if ( $fp = fopen('ut8_text_file.txt') ) {

    echo fgets($fp);
    fclose($fp);

}

?>

Expected result:
----------------
Whatever text is saved on the first line of the text file.

Actual result:
--------------
Whatever text is saved on the first line of the text file.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2009-08-25 06:44 UTC] jani@php.net

Of course it does. If it didn't, it would be broken.

[2009-08-25 16:12 UTC] soapergem at gmail dot com

But generally speaking this isn't desired behavior from the user standpoint. When you open a file in Notepad that has this character at the front, you never see it. I never knew it was there until trying to read it raw through PHP, since it is clearly not intended to be part of the content, but instead part of the file meta-data.

It would be unwise not to expect that Unicode will eventually become the standard. And currently the burden is on the PHP user to account for it, when I think it should be a language feature. So I suggest doing something like adding a letter to the "mode" of fopen(), for instance something like this:

$fp = fopen('utf8_text_file.txt', 'ru');

The "u" would indicate that the file *may* be encoded in UTF-8, and if so, throw out the BOM at the front. This would mean that fseek'ing to 0 would effectively start just after the BOM (if present), and the file would be initialized to this seek position. This would provide backwards compatibility, since you would have to change the fopen() mode for it to detect the BOM. And it'd make things for PHP users like myself a lot easier.

Just a thought.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Tue Jul 01 18:01:35 2025 UTC