PHP :: Bug #50139 :: text in UTF-8 encoded xml cut off by xml parser with German umlauts

Bug #50139	text in UTF-8 encoded xml cut off by xml parser with German umlauts
Submitted:	2009-11-10 17:59 UTC	Modified:	2009-11-11 22:48 UTC
From:	gros at mpdl dot mpg dot de	Assigned:
Status:	Not a bug	Package:	XML Reader
PHP Version:	5.3.0 -> 5.2.5	OS:	Mac OS-X 10.6.2
Private report:	No	CVE-ID:	None

View Developer Edit

[2009-11-10 17:59 UTC] gros at mpdl dot mpg dot de

Description:
------------
When parsing an xml file with UTF-8 encoding (like this one: http://bit.ly/3PSi44), text containing German umlauts is cut off:

original:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"


Reproduce code:
---------------
$snippet = file_get_contents("http://bit.ly/3PSi44");

if (!($xml_parser = xml_parser_create("")))	
				die("Couldn't create parser.");
						xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING,'UTF-8');  
						xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler");
						xml_set_character_data_handler( $xml_parser, "characterDataHandler");

						$retstr = "";
						if(!xml_parse($xml_parser, $snippet)) 
							{
							$retstr = sprintf("XML error: %s at line %d",
												xml_error_string(xml_get_error_code($xml_parser)),
												xml_get_current_line_number($xml_parser));
							}
						xml_parser_free($xml_parser);




Expected result:
----------------
I expect properly imported text like outlined in the description:

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

should result in:
"Kaiser Wilhelm Institut f?r Z?chtungsforschung"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

should result in "Societ?ts-Verlag"

Actual result:
--------------
I get cut-off pieces of text when the text contains German umlauts (see two examples in the description).

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

results in:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2009-11-10 18:02 UTC] gros at mpdl dot mpg dot de

Just to add:
I also used curl for fetching this piece of xml and the result was the same.

[2009-11-11 12:41 UTC] jani@php.net

It might work better if your xml file told the encoding OR if you told the xml_parser_create() the input encoding..

[2009-11-11 12:42 UTC] jani@php.net

Duh, i missed the very first line in your xml file. :)
So what you're actually reporting is that the input encoding isn't detected properly?

[2009-11-11 12:46 UTC] gros at mpdl dot mpg dot de

Thanks, but the file is telling it's encoding, actually. Both in the header (application/xml) and in the file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>


And also using 

$xml_parser = xml_parser_create("UTF-8");

does not help!

[2009-11-11 12:47 UTC] jani@php.net

And please provide the complete script you used. It works fine for me with very crude script..

[2009-11-11 17:04 UTC] gros at mpdl dot mpg dot de

Apologies, apparently there are two php installations on my system. The one that the xampp installation uses is actually 5.2.5, not 5.3.0. 


I am using DOMDocument now for parsing and it works like a charm.

[2009-11-11 22:48 UTC] jani@php.net

Reopen if you can reproduce with something more recent.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Sat Jul 04 04:00:01 2026 UTC