php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #35447 xml_parse_into_struct() chokes on the UTF-8 BOM
Submitted: 2005-11-28 14:55 UTC Modified: 2005-12-19 15:18 UTC
From: saramaca at libertysurf dot fr Assigned: rrichards (profile)
Status: Closed Package: XML related
PHP Version: 5CVS-2005-12-19 (cvs) OS: *
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: saramaca at libertysurf dot fr
New email:
PHP Version: OS:

 

 [2005-11-28 14:55 UTC] saramaca at libertysurf dot fr
Description:
------------
In PHP4 xml_parse_into_struct() can parse an UTF-8-encoded XML file with or without a UTF-8 BOM (\xEF\xBB\xBF). In PHP 5, this is no longer the case and it raises an error saying the string doesn't contain any XML data (Empty document). 

Additionally PHP 5's xml_parse_into_struct() does *NOT* place default attribute values into the struct (e.g. despite the DTD provided, $content[1]['attributes']['type'] isn't set to "literal" in actual result section below ; please compare it to expected result.) This used to work under PHP 4.1.x and above (but the parser is based on expat AFAIK.) 

PS: I guess "manually" stripping this magic number -- if embedded -- before calling the function would yield the expected result. However I found an acceptable work-around that seems to work equally well across versions 4 and 5 of PHP :

<?php
...
$parser = xml_parser_create('');
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, $encoding);
...
?>

Rather than:

<?php
...
$parser = xml_parser_create($encoding);
...
?>

Reproduce code:
---------------
http://www.diptyque.net/bugs/utf8_bom.php
; running PHP 4 --> outputs expected result

http://www.diptyque.net/bugs/utf8_bom.phps
; source code

Expected result:
----------------
w/ autodetect -->
Array
(
    [0] => Array
        (
            [tag] => bundle
            [type] => open
            [level] => 1
            [value] =>

        )

    [1] => Array
        (
            [tag] => resource
            [type] => complete
            [level] => 2
            [attributes] => Array
                (
                    [key] => rSeeYou
                    [type] => literal
                )

            [value] => A bient&244;t
        )

    [2] => Array
        (
            [tag] => bundle
            [value] =>

            [type] => cdata
            [level] => 1
        )

    [3] => Array
        (
            [tag] => bundle
            [type] => close
            [level] => 1
        )

)
w/o autodetect -->
Array
(
    [0] => Array
        (
            [tag] => bundle
            [type] => open
            [level] => 1
            [value] =>

        )

    [1] => Array
        (
            [tag] => resource
            [type] => complete
            [level] => 2
            [attributes] => Array
                (
                    [key] => rSeeYou
                    [type] => literal
                )

            [value] => A bient&244;t
        )

    [2] => Array
        (
            [tag] => bundle
            [value] =>

            [type] => cdata
            [level] => 1
        )

    [3] => Array
        (
            [tag] => bundle
            [type] => close
            [level] => 1
        )

)

Actual result:
--------------
w/ autodetect -->
Array
(
    [0] => Array
        (
            [tag] => bundle
            [type] => open
            [level] => 1
            [value] =>

        )

    [1] => Array
        (
            [tag] => resource
            [type] => complete
            [level] => 2
            [attributes] => Array
                (
                    [key] => rSeeYou
                )

            [value] => A bient&244;t
        )

    [2] => Array
        (
            [tag] => bundle
            [value] =>

            [type] => cdata
            [level] => 1
        )

    [3] => Array
        (
            [tag] => bundle
            [type] => close
            [level] => 1
        )

)
w/o autodetect -->
Empty document

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-11-28 18:03 UTC] iliaa@php.net
expat vs libxml2 incompatibility?
 [2005-11-28 20:28 UTC] rrichards@php.net
As far as the default attribute values - have to check on expat behavior.

The other issue is fixed with libxml2 2.6.18. I have a patch (http://www.ctindustries.net/patches/xml.compat.diff.txt) that looks like it should work around the issue with older libxml2 libs, but need more testing with different encoding/BOM schemes to make sure it doesnt break anything as were playing with the libxml encoding handling here.
 [2005-12-19 08:57 UTC] sniper@php.net
Rob, what's the status with this?
 [2005-12-19 15:18 UTC] rrichards@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

as for default values you have to create namespace parser for those to work.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Thu Nov 21 12:01:29 2024 UTC