php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #51903 simplexml_load_file() doesn't use HTTP headers
Submitted: 2010-05-25 07:16 UTC Modified: 2021-03-02 16:25 UTC
Votes:5
Avg. Score:4.4 ± 0.8
Reproduced:4 of 4 (100.0%)
Same Version:3 (75.0%)
Same OS:0 (0.0%)
From: phpwnd at gmail dot com Assigned: cmb (profile)
Status: Closed Package: SimpleXML related
PHP Version: 5.3.2 OS:
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: phpwnd at gmail dot com
New email:
PHP Version: OS:

 

 [2010-05-25 07:16 UTC] phpwnd at gmail dot com
Description:
------------
Seen at http://stackoverflow.com/questions/2899274/

If you use simplexml_load_file() to load a remote document via HTTP, SimpleXML assumes that the content is UTF-8 regardless of the HTTP headers. In the test script below, at the time of writing, Google's web server returns something like:

-------------
HTTP/1.1 200 OK
Content-Type: text/xml; charset=GB2312
Date: Tue, 25 May 2010 05:05:17 GMT
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Cache-Control: no-cache, no-store, must-revalidate
expires=Thu, 24-May-2012 05:05:17 GMT; path=/; domain=.google.com
X-Content-Type-Options: nosniff
Server: igfe
X-XSS-Protection: 1; mode=block
Transfer-Encoding: chunked

<?xml version="1.0"?><xml_api_reply version="1">
<!-- single-byte encoded GB2312 stuff -->
</xml_api_reply>
-------------

The server advertises the content "text/xml; charset=GB2312", but since the XML declaration doesn't mention the encoding, SimpleXML assumes it is UTF-8 and eventually fails to load it.

If it is at all possible, SimpleXML (and DOM, I assume) should look at the HTTP headers to find the document's encoding.

Test script:
---------------
simplexml_load_file('http://www.google.com/ig/api?weather=11791&hl=zh-CN');

Actual result:
--------------
PHP Warning:  simplexml_load_file(): http://www.google.com/ig/api?weather=11791&hl=zh-CN:1: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC7 0xE7 0x22 0x2F in Command line code on line 1

Warning: simplexml_load_file(): http://www.google.com/ig/api?weather=11791&hl=zh-CN:1: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC7 0xE7 0x22 0x2F in Command line code on line 1
PHP Warning:  simplexml_load_file(): t_system data="SI"/></forecast_information><current_conditions><condition data=" in Command line code on line 1

Warning: simplexml_load_file(): t_system data="SI"/></forecast_information><current_conditions><condition data=" in Command line code on line 1
PHP Warning:  simplexml_load_file():                                                                                ^ in Command line code on line 1

Warning: simplexml_load_file():

Patches

check_stream-wrapperdata_for_encoding (last revision 2010-05-26 13:02 UTC by mike@php.net)

Pull Requests

Pull requests:

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2010-05-26 15:02 UTC] mike@php.net
The following patch has been added/updated:

Patch Name: check_stream-wrapperdata_for_encoding
Revision:   1274878924
URL:        http://bugs.php.net/patch-display.php?bug=51903&patch=check_stream-wrapperdata_for_encoding&revision=1274878924
 [2010-05-26 15:03 UTC] mike@php.net
generally, something like the attached patch could make it, but not for this particular problem, as libxml2 does not know gb2312 -- see libxml2/libxml/encoding.h
 [2010-05-26 15:22 UTC] felipe@php.net
-Status: Open +Status: Assigned -Assigned To: +Assigned To: rrichards
 [2017-10-24 06:15 UTC] kalle@php.net
-Status: Assigned +Status: Open -Assigned To: rrichards +Assigned To:
 [2021-03-02 16:25 UTC] cmb@php.net
-Status: Open +Status: Verified -Assigned To: +Assigned To: cmb
 [2021-03-02 16:25 UTC] cmb@php.net
Still unresolved.
 [2021-03-02 18:42 UTC] cmb@php.net
The following pull request has been associated:

Patch Name: Fix #51903: simplexml_load_file() doesn't use HTTP headers
On GitHub:  https://github.com/php/php-src/pull/6747
Patch:      https://github.com/php/php-src/pull/6747.patch
 [2021-03-08 14:17 UTC] cmb@php.net
Automatic comment on behalf of cmbecker69@gmx.de
Revision: http://git.php.net/?p=php-src.git;a=commit;h=f901bec494ae921f36e1066e4380b92888757f0f
Log: Fix #51903: simplexml_load_file() doesn't use HTTP headers
 [2021-03-08 14:17 UTC] cmb@php.net
-Status: Verified +Status: Closed
 
PHP Copyright © 2001-2025 The PHP Group
All rights reserved.
Last updated: Sun Apr 13 01:01:29 2025 UTC