php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #50139 text in UTF-8 encoded xml cut off by xml parser with German umlauts
Submitted: 2009-11-10 17:59 UTC Modified: 2009-11-11 22:48 UTC
From: gros at mpdl dot mpg dot de Assigned:
Status: Not a bug Package: XML Reader
PHP Version: 5.3.0 -> 5.2.5 OS: Mac OS-X 10.6.2
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: gros at mpdl dot mpg dot de
New email:
PHP Version: OS:

 

 [2009-11-10 17:59 UTC] gros at mpdl dot mpg dot de
Description:
------------
When parsing an xml file with UTF-8 encoding (like this one: http://bit.ly/3PSi44), text containing German umlauts is cut off:

original:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"


Reproduce code:
---------------
$snippet = file_get_contents("http://bit.ly/3PSi44");

if (!($xml_parser = xml_parser_create("")))	
				die("Couldn't create parser.");
						xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING,'UTF-8');  
						xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler");
						xml_set_character_data_handler( $xml_parser, "characterDataHandler");

						$retstr = "";
						if(!xml_parse($xml_parser, $snippet)) 
							{
							$retstr = sprintf("XML error: %s at line %d",
												xml_error_string(xml_get_error_code($xml_parser)),
												xml_get_current_line_number($xml_parser));
							}
						xml_parser_free($xml_parser);




Expected result:
----------------
I expect properly imported text like outlined in the description:

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

should result in:
"Kaiser Wilhelm Institut f?r Z?chtungsforschung"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

should result in "Societ?ts-Verlag"

Actual result:
--------------
I get cut-off pieces of text when the text contains German umlauts (see two examples in the description).

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

results in:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-11-10 18:02 UTC] gros at mpdl dot mpg dot de
Just to add:
I also used curl for fetching this piece of xml and the result was the same.
 [2009-11-11 12:41 UTC] jani@php.net
It might work better if your xml file told the encoding OR if you told the xml_parser_create() the input encoding..
 [2009-11-11 12:42 UTC] jani@php.net
Duh, i missed the very first line in your xml file. :)
So what you're actually reporting is that the input encoding isn't detected properly?
 [2009-11-11 12:46 UTC] gros at mpdl dot mpg dot de
Thanks, but the file is telling it's encoding, actually. Both in the header (application/xml) and in the file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>


And also using 

$xml_parser = xml_parser_create("UTF-8");

does not help!
 [2009-11-11 12:47 UTC] jani@php.net
And please provide the complete script you used. It works fine for me with very crude script..
 [2009-11-11 17:04 UTC] gros at mpdl dot mpg dot de
Apologies, apparently there are two php installations on my system. The one that the xampp installation uses is actually 5.2.5, not 5.3.0. 


I am using DOMDocument now for parsing and it works like a charm.
 [2009-11-11 22:48 UTC] jani@php.net
Reopen if you can reproduce with something more recent.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Tue Dec 01 06:01:23 2020 UTC