php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #51903 simplexml_load_file() doesn't use HTTP headers
Submitted: 2010-05-25 07:16 UTC Modified: 2017-10-24 06:15 UTC
Votes:5
Avg. Score:4.4 ± 0.8
Reproduced:4 of 4 (100.0%)
Same Version:3 (75.0%)
Same OS:0 (0.0%)
From: phpwnd at gmail dot com Assigned:
Status: Open Package: SimpleXML related
PHP Version: 5.3.2 OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: phpwnd at gmail dot com
New email:
PHP Version: OS:

 

 [2010-05-25 07:16 UTC] phpwnd at gmail dot com
Description:
------------
Seen at http://stackoverflow.com/questions/2899274/

If you use simplexml_load_file() to load a remote document via HTTP, SimpleXML assumes that the content is UTF-8 regardless of the HTTP headers. In the test script below, at the time of writing, Google's web server returns something like:

-------------
HTTP/1.1 200 OK
Content-Type: text/xml; charset=GB2312
Date: Tue, 25 May 2010 05:05:17 GMT
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Cache-Control: no-cache, no-store, must-revalidate
expires=Thu, 24-May-2012 05:05:17 GMT; path=/; domain=.google.com
X-Content-Type-Options: nosniff
Server: igfe
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked

<?xml version="1.0"?><xml_api_reply version="1">
<!-- single-byte encoded GB2312 stuff -->
</xml_api_reply>
-------------

The server advertises the content "text/xml; charset=GB2312", but since the XML declaration doesn't mention the encoding, SimpleXML assumes it is UTF-8 and eventually fails to load it.

If it is at all possible, SimpleXML (and DOM, I assume) should look at the HTTP headers to find the document's encoding.

Test script:
---------------
simplexml_load_file('http://www.google.com/ig/api?weather=11791&hl=zh-CN');

Actual result:
--------------
PHP Warning:  simplexml_load_file(): http://www.google.com/ig/api?weather=11791&hl=zh-CN:1: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC7 0xE7 0x22 0x2F in Command line code on line 1

Warning: simplexml_load_file(): http://www.google.com/ig/api?weather=11791&hl=zh-CN:1: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC7 0xE7 0x22 0x2F in Command line code on line 1
PHP Warning:  simplexml_load_file(): t_system data="SI"/></forecast_information><current_conditions><condition data=" in Command line code on line 1

Warning: simplexml_load_file(): t_system data="SI"/></forecast_information><current_conditions><condition data=" in Command line code on line 1
PHP Warning:  simplexml_load_file():                                                                                ^ in Command line code on line 1

Warning: simplexml_load_file():

Patches

check_stream-wrapperdata_for_encoding (last revision 2010-05-26 13:02 UTC by mike@php.net)

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-05-26 15:02 UTC] mike@php.net
The following patch has been added/updated:

Patch Name: check_stream-wrapperdata_for_encoding
Revision:   1274878924
URL:        http://bugs.php.net/patch-display.php?bug=51903&patch=check_stream-wrapperdata_for_encoding&revision=1274878924
 [2010-05-26 15:03 UTC] mike@php.net
generally, something like the attached patch could make it, but not for this particular problem, as libxml2 does not know gb2312 -- see libxml2/libxml/encoding.h
 [2010-05-26 15:22 UTC] felipe@php.net
-Status: Open +Status: Assigned -Assigned To: +Assigned To: rrichards
 [2017-10-24 06:15 UTC] kalle@php.net
-Status: Assigned +Status: Open -Assigned To: rrichards +Assigned To:
 
PHP Copyright © 2001-2019 The PHP Group
All rights reserved.
Last updated: Wed Sep 18 09:01:27 2019 UTC