php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #35447 xml_parse_into_struct() chokes on the UTF-8 BOM
Submitted: 2005-11-28 14:55 UTC Modified: 2005-12-19 15:18 UTC
From: saramaca at libertysurf dot fr Assigned: rrichards (profile)
Status: Closed Package: XML related
PHP Version: 5CVS-2005-12-19 (cvs) OS: *
Private report: No CVE-ID: None
View Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
If you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: saramaca at libertysurf dot fr
New email:
PHP Version: OS:

 

 [2005-11-28 14:55 UTC] saramaca at libertysurf dot fr
Description:
------------
In PHP4 xml_parse_into_struct() can parse an UTF-8-encoded XML file with or without a UTF-8 BOM (\xEF\xBB\xBF). In PHP 5, this is no longer the case and it raises an error saying the string doesn't contain any XML data (Empty document). 

Additionally PHP 5's xml_parse_into_struct() does *NOT* place default attribute values into the struct (e.g. despite the DTD provided, $content[1]['attributes']['type'] isn't set to "literal" in actual result section below ; please compare it to expected result.) This used to work under PHP 4.1.x and above (but the parser is based on expat AFAIK.) 

PS: I guess "manually" stripping this magic number -- if embedded -- before calling the function would yield the expected result. However I found an acceptable work-around that seems to work equally well across versions 4 and 5 of PHP :

<?php
...
$parser = xml_parser_create('');
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, $encoding);
...
?>

Rather than:

<?php
...
$parser = xml_parser_create($encoding);
...
?>

Reproduce code:
---------------
http://www.diptyque.net/bugs/utf8_bom.php
; running PHP 4 --> outputs expected result

http://www.diptyque.net/bugs/utf8_bom.phps
; source code

Expected result:
----------------
w/ autodetect -->
Array
(
    [0] => Array
        (
            [tag] => bundle
            [type] => open
            [level] => 1
            [value] =>

        )

    [1] => Array
        (
            [tag] => resource
            [type] => complete
            [level] => 2
            [attributes] => Array
                (
                    [key] => rSeeYou
                    [type] => literal
                )

            [value] => A bient&244;t
        )

    [2] => Array
        (
            [tag] => bundle
            [value] =>

            [type] => cdata
            [level] => 1
        )

    [3] => Array
        (
            [tag] => bundle
            [type] => close
            [level] => 1
        )

)
w/o autodetect -->
Array
(
    [0] => Array
        (
            [tag] => bundle
            [type] => open
            [level] => 1
            [value] =>

        )

    [1] => Array
        (
            [tag] => resource
            [type] => complete
            [level] => 2
            [attributes] => Array
                (
                    [key] => rSeeYou
                    [type] => literal
                )

            [value] => A bient&244;t
        )

    [2] => Array
        (
            [tag] => bundle
            [value] =>

            [type] => cdata
            [level] => 1
        )

    [3] => Array
        (
            [tag] => bundle
            [type] => close
            [level] => 1
        )

)

Actual result:
--------------
w/ autodetect -->
Array
(
    [0] => Array
        (
            [tag] => bundle
            [type] => open
            [level] => 1
            [value] =>

        )

    [1] => Array
        (
            [tag] => resource
            [type] => complete
            [level] => 2
            [attributes] => Array
                (
                    [key] => rSeeYou
                )

            [value] => A bient&244;t
        )

    [2] => Array
        (
            [tag] => bundle
            [value] =>

            [type] => cdata
            [level] => 1
        )

    [3] => Array
        (
            [tag] => bundle
            [type] => close
            [level] => 1
        )

)
w/o autodetect -->
Empty document

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2005-11-28 18:03 UTC] iliaa@php.net
expat vs libxml2 incompatibility?
 [2005-11-28 20:28 UTC] rrichards@php.net
As far as the default attribute values - have to check on expat behavior.

The other issue is fixed with libxml2 2.6.18. I have a patch (http://www.ctindustries.net/patches/xml.compat.diff.txt) that looks like it should work around the issue with older libxml2 libs, but need more testing with different encoding/BOM schemes to make sure it doesnt break anything as were playing with the libxml encoding handling here.
 [2005-12-19 08:57 UTC] sniper@php.net
Rob, what's the status with this?
 [2005-12-19 15:18 UTC] rrichards@php.net
This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

as for default values you have to create namespace parser for those to work.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sat Nov 23 07:01:29 2024 UTC