|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #50139 text in UTF-8 encoded xml cut off by xml parser with German umlauts
Submitted: 2009-11-10 17:59 UTC Modified: 2009-11-11 22:48 UTC
From: gros at mpdl dot mpg dot de Assigned:
Status: Not a bug Package: XML Reader
PHP Version: 5.3.0 -> 5.2.5 OS: Mac OS-X 10.6.2
Private report: No CVE-ID: None
 [2009-11-10 17:59 UTC] gros at mpdl dot mpg dot de
When parsing an xml file with UTF-8 encoding (like this one:, text containing German umlauts is cut off:

<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this

results in "?ts-Verlag"

Reproduce code:
$snippet = file_get_contents("");

if (!($xml_parser = xml_parser_create("")))	
				die("Couldn't create parser.");
						xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING,'UTF-8');  
						xml_set_character_data_handler( $xml_parser, "characterDataHandler");

						$retstr = "";
						if(!xml_parse($xml_parser, $snippet)) 
							$retstr = sprintf("XML error: %s at line %d",

Expected result:
I expect properly imported text like outlined in the description:

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

should result in:
"Kaiser Wilhelm Institut f?r Z?chtungsforschung"

or parsing this

should result in "Societ?ts-Verlag"

Actual result:
I get cut-off pieces of text when the text contains German umlauts (see two examples in the description).

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

results in:
"Kaiser Wilhelm Institut f"

or parsing this

results in "?ts-Verlag"


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2009-11-10 18:02 UTC] gros at mpdl dot mpg dot de
Just to add:
I also used curl for fetching this piece of xml and the result was the same.
 [2009-11-11 12:41 UTC]
It might work better if your xml file told the encoding OR if you told the xml_parser_create() the input encoding..
 [2009-11-11 12:42 UTC]
Duh, i missed the very first line in your xml file. :)
So what you're actually reporting is that the input encoding isn't detected properly?
 [2009-11-11 12:46 UTC] gros at mpdl dot mpg dot de
Thanks, but the file is telling it's encoding, actually. Both in the header (application/xml) and in the file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

And also using 

$xml_parser = xml_parser_create("UTF-8");

does not help!
 [2009-11-11 12:47 UTC]
And please provide the complete script you used. It works fine for me with very crude script..
 [2009-11-11 17:04 UTC] gros at mpdl dot mpg dot de
Apologies, apparently there are two php installations on my system. The one that the xampp installation uses is actually 5.2.5, not 5.3.0. 

I am using DOMDocument now for parsing and it works like a charm.
 [2009-11-11 22:48 UTC]
Reopen if you can reproduce with something more recent.
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Tue Dec 01 06:01:23 2020 UTC