PHP Bugs  
php.net | support | documentation | report a bug | advanced search | search howto | statistics | login

go to bug id or search bugs for  

Bug #22108 php doesn't ignore the utf-8 BOM
Submitted:7 Feb 2003 8:46am UTC Modified: 22 Aug 2005 6:35pm UTC
From:bugzilla at jellycan dot com Assigned to:moriyoshi
Status:Wont fix Category:Feature/Change Request
Version:* OS:*
Votes:327 Avg. Score:4.8 ± 0.6 Reproduced:293 of 297 (98.7%)
Same Version:184 (62.8%) Same OS:196 (66.9%)
View/Vote Developer Edit Submission

Have you experienced this issue?
Rate the importance of this bug to you:

[7 Feb 2003 8:46am UTC] bugzilla at jellycan dot com
Problem:
When a php file is saved in utf-8 format with the UTF-8 BOM as the first
three bytes of the file (EF BB BF), PHP doesn't ignore these bytes when
loading and compiling the file, but instead considers them output coming
prior to the <?php. This causes incorrect display of the page and
failure of any http header output.

It does this even when the internal character format is set in php.ini
to be utf-8. 

Desired outcome:
PHP recognizes the utf-8 bom and disregards it.
[7 Feb 2003 11:13pm UTC] bugzilla at jellycan dot com
The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps not
so on unix) to make use of the BOM encoded in UTF-8 to flag the file as
being in UTF-8 format. This allows editors to determine the type of the
file from the first few characters instead of trying to guess what type
the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus I have php files where variable values are strings
of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible)
language
* php is designed such that php commands can be liberally sprinkled
through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of
indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in
doing so prevents header output

I request that you don't see this as a feature request, but as a bug in
the handling of utf-8 files. Whether the output generator is the correct
characterization of this bug or not I leave up to you.

Regards,
Brodie.
[31 Oct 2003 11:12am UTC] fujimoto@php.net
I added i18n support to Zend Engine 2 (though it's still partial
one...), and one of its features contain awareness of BOM. So now you
can gracefully parse scripts with BOM if you use PHP 5.0.0b2 and
configure it with the option '--enable-zend-multibyte'.

These features are still experimental and under testing, so that I have
not been documented these but I'll add the entry to the manual,
ZEND_CHANGES and so on if I feel certain of the stability and robustness
of my patch, though I do not know when it is:)

Anyway, I'll close this bug if '--enable-zend-multibyte' option in PHP
5.0.0b2 is assured to work well for this problem. Comments are welcome.
[8 Nov 2003 6:45am UTC] a9c83cd8bb41db324db5b449352f183 at arcor dot de
I think the best would be that PHP recognizes the BOM and outputs it
before it outputs the document (but after the HTTP headers, of course)
so that the document can still be recognized as UTF-8 when it's saved to
disk (where no Content-Type headers with a charset specification are
available).
[9 Nov 2003 4:12pm UTC] a9c83cd8bb41db324db5b449352f183 at arcor dot de
Thought about it... Now I think it's better when the BOM isn't part of
the output because that would cause trouble if you want to output images
or PDF or something like that...
[25 May 2004 12:33pm UTC] lapo at lapo dot it
Adding '--enable-zend-multibyte' to latest PHP5 port for FreeBSD for
sure solves the problem:

All files contain:
<?
header("Content-Language: it");
echo "àèéìòù\n";
?>

cyberx [~] $ php /usr/tmp/utf8-bom.php 
à èéìòù
cyberx [~] $ php /usr/tmp/utf8Y-bom.php 
àèéìòù
cyberx [~] $ php /usr/tmp/utf16-bom.php 
àèéìòù
cyberx [~] $ php /usr/tmp/utf16BE-bom.php 
àèéìòù
cyberx [~] $ php /usr/tmp/utf16LE-bom.php 
àèéìòù

Except for "UTF8 without BOM" that is, of course, not distinguishable
from ISO8859-15 (default here), all theother formats are correctly
interpreted and outputted.
(notice that the 'header' instruction prior of the 'echo' one would
stutter with a non-BOM-aware PHP compile).

I wonder if and when this great multibyte support would be available by
default in Win32 compiles, I would really use it for work and am not
willing to but VisualC just to compile that ;-)
(though I'm trying compiling it with cygwin's gcc using '-mno-cygwin'
option, we'll see...)
[6 Jan 2005 9:08pm UTC] techtonik@php.net
How about making this --enable-zend-multibyte default option?
Is it possible to port this support for windows too?
And for 4.3.x branch?
Should it be marked open again?
[12 Jan 2005 6:36pm UTC] lapo at lapo dot it
> How about making this --enable-zend-multibyte default option?

It is already available on Windows. In fact, I'm using it on a
production server since june 2003, with no problems and with many
satisfactions.

Any reason this is still not in by default?
Someone else is encountering bugs with it?
[12 Jan 2005 11:43pm UTC] lapo at lapo dot it
> Is it possible to port this support for windows too?

Of course I quoted the wrong like, zend-multibyte support is POSSIBLE
(not DEFAULT) in the Windows version.
[22 Aug 2005 6:32pm UTC] jwagner at cc dot hut dot fi
PHP 5.0.4 for Windows /still/ does not seem to have it
(enable-zend-multibyte) enabled by default. For example session_start()
is broken for UTF-8 encoded php files. I would strongly suggest to make
enable-zend-multibyte a default for the windows release!
[22 Aug 2005 6:35pm UTC] derick@php.net
This will come with Unicode support in PHP 6.0

RSS feed | show source 

PHP Copyright © 2001-2009 The PHP Group
All rights reserved.
Last updated: Sat Nov 21 10:30:49 2009 UTC