PHP :: Request #74312 :: Filter BOMs out of PHP output

Filter BOMs out of PHP output

Submitted:

2017-03-25 21:01 UTC

Modified:

2021-07-26 14:55 UTC

Votes:	5
Avg. Score:	4.4 ± 1.2
Reproduced:	5 of 5 (100.0%)
Same Version:	2 (40.0%)
Same OS:	1 (20.0%)

From:

furun at arcor dot de

Assigned:

Status:

Re-Opened

Package:

Unicode Engine related

PHP Version:

7.0.17

OS:

Win7

Private report:

CVE-ID:

None

View Add Comment Developer Edit

Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please — but make sure to vote on the bug!

Your email address: MUST BE VALID
Solve the problem: 37 - 9 = ?
Subscribe to this entry?

[2017-03-25 21:01 UTC] furun at arcor dot de

Description:
------------
PHP send the BOM (before the tag "<?php") to the output. This is a source for some very strange buggy behavior, which are sometimes very difficult to find, because a developer searches errors first in the code it self and not in a mostly invisible BOM.
(I searched a "bug" in a image creating script for long time, before i found the BOM-Bug in a third party plugin far away from the "buggy code". The common "Headers already sent" error caused by the BOM are still confusing, but easier to find then a corrupted binary file.)

I would suggest that PHP filters every BOM from PHP output.

(Or is there any reason not to do so i not know of?)


Test script:
---------------
(Use any file with BOM)

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports

[2017-03-25 21:08 UTC] spam2 at rhsoft dot net

This is simply not possible because the BOM is not part of the php output at all - php don't and must not care about anything outside of <?php ?>

[2017-03-25 21:29 UTC] furun at arcor dot de

I understand, but should this be a dogma in this case? 
In practice, a file has to be opened to process it, or in other words, php care about opening a source file to check if there is <?php code. The file handler could make a exception if a php-file is processed?
I can not think of any practical use of this behafior, if a BOM must be send to output, it can be done in code... So it would be real-live solution.

And dependent of the editor software behavior, no BOM can cause issues too. So would be nice if a BOM can be present in a PHP file, and be ignored in the output.

I speculate it is a common and often appearing issue, so maybe deserves a exceptional treatment.

(Thanks for your VERY quick response, this was the fastest ever ;-)
(And sorry if a ask more, i am not a deep PHP-Dev insider.)

[2017-03-25 21:48 UTC] nikic@php.net

IIRC it is possible to strip BOMs by using zend.multibyte together with zend.detect_unicode. However, I would strongly recommend against doing this. Instead, you should adjust whatever tooling you use to not insert BOMs for UTF-8 files. BOMs are meaningless for UTF-8 and Unicode recommends against their use.

[2017-03-25 22:06 UTC] furun at arcor dot de

So countless developers have to care about this, instead of php in central point...
In my case, the error came form a automatic update of a plugin form a third party.

[2017-03-27 17:51 UTC] cmb@php.net

-Status: Open +Status: Not a bug -Assigned To: +Assigned To: cmb

[2017-03-27 17:51 UTC] cmb@php.net

> So countless developers have to care about this, instead of php
> in central point...

PHP didn't invent the BOM in the first place, and AFAIK a BOM was
never intended for UTF-8 (but rather for UTF-16 were it is
important). Anyhow, as nikic already said: adjust your tooling to
not insert a BOM into UTF-8 encoded documents, and file bug
reports against third-party libraries which ship UTF-8 encoded
files containing a BOM.

As *last* *resort* you can use output_buffering to mitigate BOM
issues.

[1] <http://php.net/manual/en/outcontrol.configuration.php#ini.output-buffering>

[2017-03-31 13:10 UTC] furun at arcor dot de

OK, like to summ up the points, and then i leave the discussion here.

Principally it is never a good idea to put a warning shield in front of a stone in the way, "be careful about the stone", instead of just take him out of the way. Every work-around advice is a "shield in front of the stone" for maybe long "detour", and should only be used temporally in development time. In most of the cases is possible and better to take stones away, even if you have to use "dynamite". (To delete mysql and replace it with mysqli is "dynamite" for this example, because massive development on php site, and coding maintenance on the other. But its done now to get "stones" out of the way.)

- You get me wrong if you think i want a fix for me, i fix it already in definite with a maintenance script, because i don't like to waste me life time for such things. I argue here for a reduction of wasted developers and users life time. For the fix of a really nasty bug-source, which could be easy fixed in php, with no compatibility pain. It needs maybe only 1 or few code lines. I write here a improvement request or bug report, not a help request in a forum. (Thanks anyway to the writers, for the work around help.)
- "zend.multibyte + zend.detect_unicode", is a "shield in front of the stone".
- "output_buffering", is a "shield in front of the stone".
- "adjust tooling", is a "shield in front of the stone". People use there favorite tools. It is not a real-life advice, otherwise you can write something like "don't code bugs", and this bug report would become useless. Obviously real-life don't work like this.
- "BOMs are meaningless for UTF-8", This is not really true, it tells a editor that he have to use UTF-8. I had scripts, again from 3. party, with mixed coding (ASCII and UTF8), and destroyed characters, because the editor was confused. Users use ugly and insecure things like this "<?php //äöü" to tell the tools to use UTF, and tools this "//Setup VIM: enc=utf-8 :". File headers are there to tell the software the data type and encoding, the UTF8 BOM is such a header.
- "PHP didn't invent the BOM", is irrelevant, it exists and causes trouble. Browser developers have understand this and ignore BOMs since long time. (I had in the far past a typical problem with a JS file, again have to waste hours, and found the BOM-bug in a 3. party code far away from the buggy behavior. Same as php, and browsers now fix this bug source.)
- "bug reports against third-part", is exactly the waste of time i argue here to stop, as close as possible to the source, instead of multiplying it to the outside to countless developers and users, and countless wasted life time hours, and bug reports, and forum workaround helps...

It look for me only like a dev-dogma, without any real-life technical reasons. If there is a technical reason i miss, it was not written here. Why PHP should not delete [BoF]BOM<?php, AND even ?>\s+[EoF] issues, (because "?>\s+" it is the same problem type from the other end of the script).
Specially now where many developers switch from php5 to php7 this could be done, if in some very rare cases it causes problems.

thanks for reading (if you feel pain of reading this long text... you maybe see the problem ;-), by

[2017-03-31 13:24 UTC] spam2 at rhsoft dot net

> If there is a technical reason i miss, it was not written here

IT WAS written here - anything outside <?php ?> is NOT part of the php script - where will you start here and where will you stop?

the next step when open taht candle of worms is "Uhm if i have linebreaks after ?> in include files output-buffering, headers and sessins are broken, why does PHP not a implicit trim()"

[2017-03-31 19:51 UTC] furun at arcor dot de

> IT WAS written here - anything outside <?php ?> is NOT part of the php script

This is the dev-dogma in question. It is not a technical reason, but a dogma.
php-scripts are written from humans in text files, not from machines in binary files.

And yes, why not do a implicit trim(). There is no technical reason why not. In most cases the extra empty chars after last "?>" are not indented, a trim() prevents human imperfectness. (My maintenance script deletes the last ?> and all empty chars after. regex = "?>\s+$") Explicit newlines can be done in code, in the rare cases where this is indented.
PHP does implicit ignorer 1 newline after ?>


> where will you start here and where will you stop?

The candle of worms is not so deep, i start by deleting the BOM at [BoF] before <?php, and stop by trim after ?> [EoF].
Otherwise the PHP community will bug-hunt and work-around and forum-help this even in decades from now.
Or stop simply by deleting the BOM, in case there are technical reasons not to trim.

[2017-03-31 20:02 UTC] cmb@php.net

-Status: Not a bug +Status: Re-Opened -Assigned To: cmb +Assigned To:

[2017-03-31 20:02 UTC] cmb@php.net

Okay, re-opening and bailing out of this discussion. Thanks!

[2017-04-01 02:40 UTC] spam2 at rhsoft dot net

> And yes, why not do a implicit trim()

because when i want a gambling machine i use a gambling machine and not a programming language

[2017-04-01 13:40 UTC] furun at arcor dot de

This is rhetoric to protect a dogma, not a technical argument. If a syntax definition is acting like a pitfall, it should be changed.
Bug prevention is the contrary of randomness. Human errors are random, and bug prevention takes them out of the way.
The write development strategies is to see, how many time is a ending newline/empty-char intended, and how many time coders are stumble over it unintended. And then take unwanted randomness out of the no-game, if indented newlines/empty-chars are very rare and can be done more explicit. 

In the middle of a PHP and HTML mixed script, precision for <?php ?> delimitations makes sense, and should not be faded.

Tanks that there is no enforcement to use ending ?>, they are useless anyway, specially in near 100% php code scripts.
Smart coders don't write any ending ?> to safe there time. 

But i see the empty-char after last ?> as not so buggy like the BOM, but i miss the good infos and experiences to decide what is best.
A very active php community forum helper maybe can tell, how often this becomes a pitfall, and if it makes sense to chance definitions.
(Just frequenting forums sometime, i seen it many time, ?>.. seems to be a pitfall for me.)

[2017-04-01 14:05 UTC] spam2 at rhsoft dot net

> This is rhetoric to protect a dogma, not a technical argument

franly, if you did not realize it - i am not a php upstream-developer and so don't need to protect any dogma - i am just a userland php-script developer for 15 years now, have written some hundret thousand lines of php-code and frankly tell you if that are real issues for a developer he RELLAY should stop develop software at all

[2017-04-01 14:08 UTC] spam2 at rhsoft dot net

and problems have to be fixed where they are

* idiotic editors which add BOM to non-UTF16 files
* idiotic php editors hich add newlines at the end of the file

period

[2017-04-01 15:02 UTC] furun at arcor dot de

(spam2) I understand you there, but what is the easier way to do, to reeducate the entire planet, or to take the stone out of the way.
People use there favorite tools, and unexperienced developers are publishing there first plugins/codes for other software, user using it without possibility to proof, and professionals make mistakes or search bugs on the wrong end (they then just not need forum help).
My arguments are there to held the php community which has to deal with it, and passive users with no coding background, and time spend and wait for bug fixes, wen it goes wrong, an it will go wrong.

You as a experiences developer will not feel any harm anyhow, if BOM and ?> is cleaned, or not.
My effort here is for the "wild real-life, all out there".

[2017-04-01 15:39 UTC] nikic@php.net

Lets not start calling people idiots...

Here is a semi-recent internals thread on the topic of BOM handling: http://markmail.org/message/besjw22hxlpwlvdh

My opinion here is essentially this:
a) First, a philosophical point. PHP, in a default configuration, does not care about file encoding. As long as it's ASCII compatible, PHP does not care whether your file uses UTF-8, ISO-8859-1 or Windows-1251. PHP does not perform any validation or conversion, it gives you everything back exactly as it received it. There *is* a special mode in which PHP does all these things, and that's zend.multibyte. If this mode is enabled, PHP can convert character encodings, strip BOMs, etc. Of course, we know that in practice there is very little interest in this.
b) Second, a practical point. You are considering the case where the BOM was accidentally introduced by bad tooling (say, people writing PHP code in MS Word). However, it may also be introduced *intentionally*, to be produced as part of the output. (For the sake of an example, assume you have template files including BOMs, because you want to deliver a response with BOMs, as tooling on the consuming side requires it.) If nothing else, this is a backwards compatibility break and a rather subtle one at that. As such, such a change could probably only be made behind an ini setting, which would probably not help with your particular concern.

[2017-04-01 17:04 UTC] spam2 at rhsoft dot net

> to reeducate the entire planet

yes, because such "education" is a prerequisite anyways - or where you born with knowing programming languages and even if why that one piece was missing

> or to take the stone out of the way

there is no stone - frankly looking at 90% of PHP code out there (leading to prohibit as much as possible 3rd party code on our sevrers for a decade now) there are way too less stones checking if someone should prdocuece code running on sevrers for which he is most of the time not responsile at his own and innocent sysadmins and attacked 3rd party sites have to chew the outcome later

[2017-04-01 21:22 UTC] furun at arcor dot de

(nikic)

a) A technical argument could be, it could cause other problems if you just strip the BOM, and not process some UTF-8/encoding stuf. php must/should maybe then care some how what the BOM says about the encoding. My question would be then, if php still does not care about file encoding, would it break something/more then leaf BOMs in the files? I think if a BOM is there, php could presume correct and indented encoding. For UTF/encoding handling then is definitely the code responsible and not php, even UTF error corrections. (i use a script for this. Far back, wen some software switched from ASCII to UTF8, i have seen some confused files, and i fixed them by script, instead to manually correct all [?]-characters).
I miss there maybe something, but my suggestion would be to just delete/ignore the BOM, and keep the file as it is. (Keep the present behavior in default configuration.)
(I never used zend.multibyte, i like if php gives all out exactly as it is. The only exception is BOM and empty chars after ?>, where i see recurring problems.)

b) A other technical argument is backwards compatibility. My suggestion is, if a BOM output is intentional, it could be done explicit in code. And it would be a other bug source if a file not have, but should have a BOM for the output. Because the BOM is in the editor likely only visible in the options and not in the text editor it self, so not visible in first view (NotePad++ for example). Explicit BOM export would be more clean. And i would speculate, a intentional export of a BOM is way less probable, then a buggy behavior with unintentional BOMs. And a developer is focused on this specific behavior to export BOMs, other then unintentional BOMs.
(PHP could define constants for all important BOMs, for explicit output, if not done already.)
The discussion i searched here is the possible backwards compatibility break, and if problems are probable and recurring as unintended BOMs.
My opinion is, intentional BOM is rare, and the PHP5 > PHP7 upgrade is the right time to the change, in case it breaks existing code, wen all developers are busy with maintenance and compatibility anyway.
If the BOM and maybe even ?> handling is done behind an ini setting, its even better, it gives developers the choice. The question and discussion is then, should it be the default setting to handle or not handle the BOM by php. I would vote with my present opinion, to strip BOM by default in ini setting, in PHP-7, maybe not in PHP-5(?) if backwards compatibility is a big issue.

The big question is, is BOM and empty chars after ?> a recurring and potentially nasty issue?
In my opinion it is, which could be fixed pragmatically simply by changing some php roles. Correct my if i am wrong there.

[2017-04-01 21:32 UTC] spam2 at rhsoft dot net

> My opinion is, intentional BOM is rare, and the PHP5 > PHP7 
> upgrade is the right time to the change, in case it breaks 
> existing code, wen all developers are busy with maintenance 
> and compatibility anyway.

that ship sailed long ago 

> If the BOM and maybe even ?> handling is done behind an ini 
> setting, its even better, it gives developers the choice

no it don't - it only makes it unpredictable if, when and where your code breaks since you as developer normally have no control about ini-settings

[2017-04-01 23:43 UTC] furun at arcor dot de

(nikic)

I read the link you post here, tanks.
http://markmail.org/message/besjw22hxlpwlvdh
Needles to say that i am on the arguments from Sammy Kaye Powers.

(Repeating my self now...:
I searched a little in files, and found more BOMs in the files then i expected. It was a quick search. Like expected, in plugins mainly, and not the main codes. In JS and CSS files too, but browsers are now all handling it, JS BOMs don't cause problems anymore in browsers. Some projects have a "don't use BOM and ending ?>" advice. And i found in quick search several posts in forums BOM and ?> related, the "headers already sent by" error is maybe most common, but by fare not the only one. The BOM is nasty, because it creates errors all over the place, "wearing an invisibility cloak so-to-speak". In my case it was a broken captsha image generator, because of a 3-party plugin far away. It was a nerving bug search because 2 scripts had a BOM, and i searched a while on the wrong sides. Because i clean this bug source in my codes, i was not trained to search the bug there, and i expect code incompatibility (not BOM) by working on PHP5 > PHP7. So nothing like "5 min google search". Experienced forum workers maybe know it better, but i think it is a source for recurring nasty issues... and i would fix it like discusses there, in the lexer.

Mono develop onces destroy one of my C# files, it was a file to handle UFT-8 encoding problems, and then gets problems it self. I was suddenly simply defect, [?]-Chars, and i never worked on it for long time. Since then, all my C# files have the UTF-8 BOM to tell Mono, "don't destroy it again, it is a UTF-8, leave it so". So against the advice, "do not use UFT-8 BOMs".

Link from Sammy Kaye Powers, and so i saw it too.
http://stackoverflow.com/search?q=php+bom
)

[2017-04-02 01:10 UTC] spam2 at rhsoft dot net

> and i expect code incompatibility (not BOM) by working on PHP5 > PHP7. 
> so nothing like "5 min google search"

what did you not understand in "that ship sailed long ago"?
PHP7.0 is GA since 2015
PHP7.1 is GA since 2016

just because you have not finished "working on PHP5 > PHP7" don't mean you can go ahead and make backward incompatible changes at this point in time - to make it clear: our current deploayable code base of 250000 line sof code is PHP7 only (return types, scalar type hints, declare(strict_types=1);) and in a short will be PHP7.1 only by introduce void-return-types all over the place

and that's what "that ship sailed long ago" means

[2017-04-03 14:52 UTC] furun at arcor dot de

Referencing http://markmail.org/message/besjw22hxlpwlvdh "don't do magic"...

Argument: See BOM handling as header, and not as implicit magic:
Standard headers for text files like used in binarys, which define the encoding and so the handling precisely, should "maybe" be defined long long long time ago. Now we are left with text/script uncertainty, editors can not do better then guess the encoding, and users need to correct them manually, and there are many. Text files are so in general more "dirty" then binary files, and specially for scripts is this not useful, and need additional documentations and definitions beside of the file/data it self. The BOM could bring at least for UTF encoding a little certainty in text/script files. UTF8 become the favorite encoding, and would be a good argument for php to tolerate it.
The BOM can then be seen as a header like for ZIP PNG or EXE files, just for a text format, even if this is (unfortunately) uncommon for text files. A header can tell a software that the data is actually a JPG and not a PNG, even if the file extension tells different. TXT or PHP not tell anything about encoding. (A good reason why php should not change anything, php can not know the intention of the script, like nikic points out in a], so this point is not only philosophical. There is a point where implicit magic become voodoo. And why i don't use zend.multibyte.) So headers can get exceptional handling, as by definition are not part of the files data, but only tell what data it is in the files.
The header tolerance would then not be implicit magic anymore, but a standard behavior for a header, the argument implicit magical is then out of the way for BOM.
(I would then put BOMs in all my PHP7s, like i did with C#s and TXTs...)

(...But this don't work for ?> cleaning, where it is indeed more controversial. This came here anyway late in the discussion, and i just jump in to it. The question is if implicit magic dose more harm, or save more harm. I found a lot of forum entry for "Headers already sent" errors because of empty chars after last ?>, wen i check the commonness for BOM problems. I have there a pragmatic point of view, if a implicit action saves way much more problems then it causes, and its intuitive right, i would do it. But i understand that the point of view of developers which are responsible for a code language which is used world-wide, is maybe different, and if they come to a conservative result.)

[2018-09-12 13:14 UTC] jean-marc at paratte dot ch

A simple solution to this annoying problem could be to introduction a new php settings to automatically remove BOM from input files (include, require, ...).

When working with multi-charset environments, it is very important to fix the charset when for example the text file is not of native operating system charset, like Windows-1252.

My Textpad editor does a very bad management of Windows-1252 and UTF-8 without BOM. And I think a lot of other softwares are in the same case.

So, please, add a simple setting to remove automatically BOM.

[2021-01-30 13:34 UTC] loodvard14 at gmail dot com

I was going to say exactly opinion of @"ean-marc at paratte dot ch";
Adding s simple setting to include / require / etc...
And also adding a general setting for change the default of that.
Like:
Include("file.php",true); // If the command can be function
or
Include "file.php,1";
or
Include "file.php /clean";

And in our wishes it remove any kind of header/BOM of files

Best regards :)

[2021-07-26 14:55 UTC] cmb@php.net

-Type: Bug +Type: Feature/Change Request

[2021-07-26 14:55 UTC] cmb@php.net

This is definitely not a bug, so changing to feature request – not
that it matters much, because changing the current behavior would
likely require an RFC[1], and apparently nobody is willing to
pursue the RFC process.

[1] <https://wiki.php.net/rfc/howto>

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2024 The PHP Group All rights reserved.	Last updated: Sat Apr 20 04:01:28 2024 UTC