PHP :: Bug #50139 :: text in UTF-8 encoded xml cut off by xml parser with German umlauts

Bug #50139	text in UTF-8 encoded xml cut off by xml parser with German umlauts
Submitted:	2009-11-10 17:59 UTC	Modified:	2009-11-11 22:48 UTC
From:	gros at mpdl dot mpg dot de	Assigned:
Status:	Not a bug	Package:	XML Reader
PHP Version:	5.3.0 -> 5.2.5	OS:	Mac OS-X 10.6.2
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	gros at mpdl dot mpg dot de
New email:
PHP Version:		OS:

New Comment:

[2009-11-10 17:59 UTC] gros at mpdl dot mpg dot de

Description:
------------
When parsing an xml file with UTF-8 encoding (like this one: http://bit.ly/3PSi44), text containing German umlauts is cut off:

original:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"


Reproduce code:
---------------
$snippet = file_get_contents("http://bit.ly/3PSi44");

if (!($xml_parser = xml_parser_create("")))	
				die("Couldn't create parser.");
						xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING,'UTF-8');  
						xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler");
						xml_set_character_data_handler( $xml_parser, "characterDataHandler");

						$retstr = "";
						if(!xml_parse($xml_parser, $snippet)) 
							{
							$retstr = sprintf("XML error: %s at line %d",
												xml_error_string(xml_get_error_code($xml_parser)),
												xml_get_current_line_number($xml_parser));
							}
						xml_parser_free($xml_parser);




Expected result:
----------------
I expect properly imported text like outlined in the description:

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

should result in:
"Kaiser Wilhelm Institut f?r Z?chtungsforschung"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

should result in "Societ?ts-Verlag"

Actual result:
--------------
I get cut-off pieces of text when the text contains German umlauts (see two examples in the description).

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

results in:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2009-11-10 18:02 UTC] gros at mpdl dot mpg dot de

Just to add:
I also used curl for fetching this piece of xml and the result was the same.

[2009-11-11 12:41 UTC] jani@php.net

It might work better if your xml file told the encoding OR if you told the xml_parser_create() the input encoding..

[2009-11-11 12:42 UTC] jani@php.net

Duh, i missed the very first line in your xml file. :)
So what you're actually reporting is that the input encoding isn't detected properly?

[2009-11-11 12:46 UTC] gros at mpdl dot mpg dot de

Thanks, but the file is telling it's encoding, actually. Both in the header (application/xml) and in the file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>


And also using 

$xml_parser = xml_parser_create("UTF-8");

does not help!

[2009-11-11 12:47 UTC] jani@php.net

And please provide the complete script you used. It works fine for me with very crude script..

[2009-11-11 17:04 UTC] gros at mpdl dot mpg dot de

Apologies, apparently there are two php installations on my system. The one that the xampp installation uses is actually 5.2.5, not 5.3.0. 


I am using DOMDocument now for parsing and it works like a charm.

[2009-11-11 22:48 UTC] jani@php.net

Reopen if you can reproduce with something more recent.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2026 The PHP Group All rights reserved.	Last updated: Sat Jul 04 06:00:01 2026 UTC