php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #22108 php doesn't ignore the utf-8 BOM
Submitted: 2003-02-07 08:46 UTC Modified: 2005-08-22 18:35 UTC
Votes:515
Avg. Score:4.7 ± 0.7
Reproduced:470 of 476 (98.7%)
Same Version:324 (68.9%)
Same OS:307 (65.3%)
From: bugzilla at jellycan dot com Assigned: moriyoshi (profile)
Status: Wont fix Package: Feature/Change Request
PHP Version: * OS: *
Private report: No CVE-ID: None
 [2003-02-07 08:46 UTC] bugzilla at jellycan dot com
Problem:
When a php file is saved in utf-8 format with the UTF-8 BOM as the first three bytes of the file (EF BB BF), PHP doesn't ignore these bytes when loading and compiling the file, but instead considers them output coming prior to the <?php. This causes incorrect display of the page and failure of any http header output.

It does this even when the internal character format is set in php.ini to be utf-8. 

Desired outcome:
PHP recognizes the utf-8 bom and disregards it.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2003-02-07 23:13 UTC] bugzilla at jellycan dot com
The BOM (byte order mark) is a few bytes at the very front of a file that act as a signature denoting what type of encoding has been used, and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8 is byte order independent, it has become popular on windows (perhaps not so on unix) to make use of the BOM encoded in UTF-8 to flag the file as being in UTF-8 format. This allows editors to determine the type of the file from the first few characters instead of trying to guess what type the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the my-language-only mindset that afflicts too many programmers (myself included before this job) and think about the growing multiplicity of languages on the web. I am writing web applications in Japan, with European language and CJK (Chinese/Japanese/Korean) language processing and interfaces. Thus I have php files where variable values are strings of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible) language
* php is designed such that php commands can be liberally sprinkled through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in doing so prevents header output

I request that you don't see this as a feature request, but as a bug in the handling of utf-8 files. Whether the output generator is the correct characterization of this bug or not I leave up to you.

Regards,
Brodie.
 [2003-10-31 11:12 UTC] fujimoto@php.net
I added i18n support to Zend Engine 2 (though it's still partial one...), and one of its features contain awareness of BOM. So now you can gracefully parse scripts with BOM if you use PHP 5.0.0b2 and configure it with the option '--enable-zend-multibyte'.

These features are still experimental and under testing, so that I have not been documented these but I'll add the entry to the manual, ZEND_CHANGES and so on if I feel certain of the stability and robustness of my patch, though I do not know when it is:)

Anyway, I'll close this bug if '--enable-zend-multibyte' option in PHP 5.0.0b2 is assured to work well for this problem. Comments are welcome.
 [2003-11-08 06:45 UTC] a9c83cd8bb41db324db5b449352f183 at arcor dot de
I think the best would be that PHP recognizes the BOM and outputs it before it outputs the document (but after the HTTP headers, of course) so that the document can still be recognized as UTF-8 when it's saved to disk (where no Content-Type headers with a charset specification are available).
 [2003-11-09 16:12 UTC] a9c83cd8bb41db324db5b449352f183 at arcor dot de
Thought about it... Now I think it's better when the BOM isn't part of the output because that would cause trouble if you want to output images or PDF or something like that...
 [2004-05-25 12:33 UTC] lapo at lapo dot it
Adding '--enable-zend-multibyte' to latest PHP5 port for FreeBSD for sure solves the problem:

All files contain:
<?
header("Content-Language: it");
echo "??????\n";
?>

cyberx [~] $ php /usr/tmp/utf8-bom.php 
? èéìòù
cyberx [~] $ php /usr/tmp/utf8Y-bom.php 
??????
cyberx [~] $ php /usr/tmp/utf16-bom.php 
??????
cyberx [~] $ php /usr/tmp/utf16BE-bom.php 
??????
cyberx [~] $ php /usr/tmp/utf16LE-bom.php 
??????

Except for "UTF8 without BOM" that is, of course, not distinguishable from ISO8859-15 (default here), all theother formats are correctly interpreted and outputted.
(notice that the 'header' instruction prior of the 'echo' one would stutter with a non-BOM-aware PHP compile).

I wonder if and when this great multibyte support would be available by default in Win32 compiles, I would really use it for work and am not willing to but VisualC just to compile that ;-)
(though I'm trying compiling it with cygwin's gcc using '-mno-cygwin' option, we'll see...)
 [2005-01-06 21:08 UTC] techtonik@php.net
How about making this --enable-zend-multibyte default option?
Is it possible to port this support for windows too?
And for 4.3.x branch?
Should it be marked open again?

 [2005-01-12 18:36 UTC] lapo at lapo dot it
> How about making this --enable-zend-multibyte default option?

It is already available on Windows. In fact, I'm using it on a production server since june 2003, with no problems and with many satisfactions.

Any reason this is still not in by default?
Someone else is encountering bugs with it?
 [2005-01-12 23:43 UTC] lapo at lapo dot it
> Is it possible to port this support for windows too?

Of course I quoted the wrong like, zend-multibyte support is POSSIBLE (not DEFAULT) in the Windows version.
 [2005-08-22 18:32 UTC] jwagner at cc dot hut dot fi
PHP 5.0.4 for Windows /still/ does not seem to have it (enable-zend-multibyte) enabled by default. For example session_start() is broken for UTF-8 encoded php files. I would strongly suggest to make enable-zend-multibyte a default for the windows release!
 [2005-08-22 18:35 UTC] derick@php.net
This will come with Unicode support in PHP 6.0
 [2013-07-04 21:50 UTC] mckeever at web dot de
I know this bug entry is old. But why is the BOM-problem set to "Wont fix"?
The last comment said the support will come with PHP 6. We all know PHP 6 is dead. 
PHP 5.5.0 has been released a week ago. But the problem persists.

Since it can not be guaranteed to have no BOMs in all files which gets included 
PHP should be able to recognize and ignore them. There is not only the problem of 
the "Headers already sent" but also NameSpacing doesn't work with BOM-infected 
files.
 [2015-09-25 22:50 UTC] vittorio dot zamparella at gmail dot com
BUMP!
Come on! it's almost a 13 years old bug, It deserves a good party!

I'm just about to overhaul a php software and it's very annoying that I'll have to use only no-BOM utf8 files. Doesn't make look php very professional. Especially considering that php outputs utf by default since forever.

Php misworkings with BOMs are subtles and very hard to detect.
On the other hand BOMs are very useful to sign utf8 files, to stay protected from casual notepad edit, to allow for legacy 8bit encoded parts of a site or program and many other applications.

The defaulting to "on" of output buffering masked the "headers already sent" problem; but that's a masking, not a solution: php still outputs a BOM when it's not requested to (eg an image) and a simple ob_flush() reveals it. 
UTF Encoding was smart enough to put the BOM in a point that means "zero width space", so that mid-page BOMS don't distrupt much browser. But again the problem still exist: http://www.w3.org/International/questions/examples/phpbomtest.php
a character that's not supposed to be, a line that shouldn't be.

I don't understand the won't fix: at least half of the solution is clear to me:
include() and require() should simply strip the BOM, that's for sure.
While for the initial BOM, if present I would output it after the headers if the output encoding is utf, ie the mimetype belongs to the text family (html, css and so on); instead php should strip the BOM for -say- image/jpg and application/whatever.

It's a simple algorithm (a bunch of ifs) and lightweight (just examing first three bytes of buffer). It may not cover ALL the possible situations, but it's not dangerous and it doesn't risk to make things worse.

Want an even simpler algorithm? strip_bom=true option and strip it always!
Browsers won't ever receive the BOM and will have to manage with http headers and html meta (has it has been for years).

That would even be an acceptable bridge to a smarter solution: webmasters could enable the BOM stripping when the project needs it.

Please, remove the won't fix. Give this bug a chance to get fixed!
 [2016-08-09 21:25 UTC] marc dot fauser+php at fauser dot ag
I was looking why my site stopped working after I tried to add
declare(strict_types=1);
The error log told me that it must be the first command in the php file. It was. After an hour I figured out that the UTF8 BOM caused the issue.
 [2016-08-30 09:44 UTC] manuel dot schmitz at liebherr dot com
In my view, the BOM is a meta information for the PHP parser and not part of the content. It should not be forwarded to the output.

Please provide an reason for the "Wont fix" status. I am sure that there is a good reason, but I would like to know about it.
 [2017-10-27 11:48 UTC] claudiu dot f dot marginean at gmail dot com
I had problems with BOM on a CSV import script.
The first column from the script was loading this character. For some reason the first column mapping or the value from the first column had this BOM character into them. Because of this the first column information was not saved into the database. I spent 2-3h investigating it. The strange part was that even xdebug do not how the BOM char, but the char is there and influence the execution of the script.

I suggest to eliminate it from fgets() and fgetcsv() as well. So that the file is loaded ok.
 [2024-02-25 18:22 UTC] holdoffhunger at gmail dot com
I must politely disagree with the status of "wont fix".  Using a BOM to indicate UTF-8 is relatively standard practice.

According to RFC3629: adding BOM to a file is "...to prepend a U+FEFF character to a stream of UCS characters as a 'signature'.  A receiver of such a serialized stream may then use the initial character as a hint that the stream consists of UCS characters and also to recognize which UCS encoding is involved..."

It's quite clear that the purpose of the BOM is not to be interpreted as a character in the data, but, as clearly stated, it's a "signature" of UTF-8 characters and a "hint" about the encoding of the characters.

Is it possible to have this an enable-able option in php.ini?  Cheers.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 13:01:29 2024 UTC