|
php.net | support | documentation | report a bug | advanced search | search howto | statistics | random bug | login |
[2009-11-10 17:59 UTC] gros at mpdl dot mpg dot de
Description: ------------ When parsing an xml file with UTF-8 encoding (like this one: http://bit.ly/3PSi44), text containing German umlauts is cut off: original: <e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name> result after parsing: "Kaiser Wilhelm Institut f" or parsing this <dc:publisher>Societ?ts-Verlag</dc:publisher> results in "?ts-Verlag" Reproduce code: --------------- $snippet = file_get_contents("http://bit.ly/3PSi44"); if (!($xml_parser = xml_parser_create(""))) die("Couldn't create parser."); xml_parser_set_option($xml_parser, XML_OPTION_TARGET_ENCODING,'UTF-8'); xml_set_element_handler($xml_parser,"startElementHandler","endElementHandler"); xml_set_character_data_handler( $xml_parser, "characterDataHandler"); $retstr = ""; if(!xml_parse($xml_parser, $snippet)) { $retstr = sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser)); } xml_parser_free($xml_parser); Expected result: ---------------- I expect properly imported text like outlined in the description: parsing this: <e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name> should result in: "Kaiser Wilhelm Institut f?r Z?chtungsforschung" or parsing this <dc:publisher>Societ?ts-Verlag</dc:publisher> should result in "Societ?ts-Verlag" Actual result: -------------- I get cut-off pieces of text when the text contains German umlauts (see two examples in the description). parsing this: <e:organization-name>Kaiser Wilhelm Institut f?r Z?chtungsforschung</e:organization-name> results in: "Kaiser Wilhelm Institut f" or parsing this <dc:publisher>Societ?ts-Verlag</dc:publisher> results in "?ts-Verlag" PatchesPull RequestsHistoryAllCommentsChangesGit/SVN commits
|
|||||||||||||||||||||||||||
Copyright © 2001-2025 The PHP GroupAll rights reserved. |
Last updated: Sun Oct 26 08:00:02 2025 UTC |
Thanks, but the file is telling it's encoding, actually. Both in the header (application/xml) and in the file: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> And also using $xml_parser = xml_parser_create("UTF-8"); does not help!