php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #50139 text in UTF-8 encoded xml cut off by xml parser with German umlauts
Submitted: 2009-11-10 17:59 UTC Modified: 2009-11-11 22:48 UTC
From: gros at mpdl dot mpg dot de Assigned:
Status: Not a bug Package: XML Reader
PHP Version: 5.3.0 -> 5.2.5 OS: Mac OS-X 10.6.2
Private report: No CVE-ID: None
Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.
Password:
Status:
Package:
Bug Type:
Summary:
From: gros at mpdl dot mpg dot de
New email:
PHP Version: OS:

 

 [2009-11-10 17:59 UTC] gros at mpdl dot mpg dot de
Description:
------------
When parsing an xml file with UTF-8 encoding (like this one: http://bit.ly/3PSi44), text containing German umlauts is cut off:

original:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

result after parsing:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"


Reproduce code:
---------------
$snippet = file_get_contents("http://bit.ly/3PSi44");

if (!($xml_parser = xml_parser_create("")))	
				die("Couldn't create parser.");
						xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING,'UTF-8');  
						xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler");
						xml_set_character_data_handler( $xml_parser, "characterDataHandler");

						$retstr = "";
						if(!xml_parse($xml_parser, $snippet)) 
							{
							$retstr = sprintf("XML error: %s at line %d",
												xml_error_string(xml_get_error_code($xml_parser)),
												xml_get_current_line_number($xml_parser));
							}
						xml_parser_free($xml_parser);




Expected result:
----------------
I expect properly imported text like outlined in the description:

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

should result in:
"Kaiser Wilhelm Institut f?r Z?chtungsforschung"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

should result in "Societ?ts-Verlag"

Actual result:
--------------
I get cut-off pieces of text when the text contains German umlauts (see two examples in the description).

parsing this:
<e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name>

results in:
"Kaiser Wilhelm Institut f"

or parsing this
<dc:publisher>Societ?ts-Verlag</dc:publisher>

results in "?ts-Verlag"

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-11-10 18:02 UTC] gros at mpdl dot mpg dot de
Just to add:
I also used curl for fetching this piece of xml and the result was the same.
 [2009-11-11 12:41 UTC] jani@php.net
It might work better if your xml file told the encoding OR if you told the xml_parser_create() the input encoding..
 [2009-11-11 12:42 UTC] jani@php.net
Duh, i missed the very first line in your xml file. :)
So what you're actually reporting is that the input encoding isn't detected properly?
 [2009-11-11 12:46 UTC] gros at mpdl dot mpg dot de
Thanks, but the file is telling it's encoding, actually. Both in the header (application/xml) and in the file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>


And also using 

$xml_parser = xml_parser_create("UTF-8");

does not help!
 [2009-11-11 12:47 UTC] jani@php.net
And please provide the complete script you used. It works fine for me with very crude script..
 [2009-11-11 17:04 UTC] gros at mpdl dot mpg dot de
Apologies, apparently there are two php installations on my system. The one that the xampp installation uses is actually 5.2.5, not 5.3.0. 


I am using DOMDocument now for parsing and it works like a charm.
 [2009-11-11 22:48 UTC] jani@php.net
Reopen if you can reproduce with something more recent.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Dec 27 12:01:29 2024 UTC