PHP :: Bug #46129 :: Apostrophe character code (’) converted to garbage by SimpleXML or LibXML

Bug #46129	Apostrophe character code (’) converted to garbage by SimpleXML or LibXML
Submitted:	2008-09-19 19:26 UTC	Modified:	2008-09-23 20:45 UTC
From:	brett at brettbrewer dot com	Assigned:
Status:	Not a bug	Package:	SimpleXML related
PHP Version:	5.2.6	OS:	Linux xq41.cyberlnc.com 2.6.18-5
Private report:	No	CVE-ID:	None

View Developer Edit

Welcome back! If you're the original bug submitter, here's where you can edit the bug or add additional notes.
If you forgot your password, you can retrieve your password here.

Password:

Status:
Package:
Bug Type:
Summary:
From:	brett at brettbrewer dot com
New email:
PHP Version:		OS:

New Comment:

[2008-09-19 19:26 UTC] brett at brettbrewer dot com

Description:
------------
When parsing an XML feed (wordpress) containing the character codes for a right single curly quote (&#8217;), the character is converted into ’. Unfortunately I'm not able to get complete access to the server to deactivate Zend optimizer, Ioncube, etc and I'm pulling the OS info from phpinfo(). I've included the URL of the actual feed that is causing the problems. I found a really old similar bug report for php 4.3.2, but nothing for PHP5.Here's the old bug report URL:

http://bugs.php.net/bug.php?id=24863&edit=2
I also found:
http://bugs.php.net/bug.php?id=26964&edit=2

which suggest a similar problem with htmlentities and html_entity_decode but I don't know if it's related. I'm sure my feed is UTF-8 and if I convert it to ISO9xxx-1 before passing it to my SimpleXML object then SimpleXML complains that it's not in UTF-8 format and aborts, so I'm pretty sure it's not a UTF8 encoding issue with the feed. I've included the feed url in the code sample below. It assumes it is inside a class, but you can probably run the code below to reproduce the symptoms just by removing the "this->" in two places.  

Reproduce code:
---------------
$this->blog_url = "http://75.126.106.225/blog/feed/";
$rawFeed = file_get_contents($this->blog_url);
$xml = new SimpleXmlElement($rawFeed); 

//you can see the results of the incorrect parsing of the feed in the left sidebar at http://75.126.106.225

Expected result:
----------------
Code should keep the &#8217; entity code intact or possibly convert it to &apos;

Actual result:
--------------
SimpleXML contstructor seems to convert all instances of &#8217; into ’

If you use SimpleXML to parse the feed at http://75.126.106.225/blog/feed/ you should see the problem in the <title> of the second item in the feed.

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports

[2008-09-23 16:21 UTC] rrichards@php.net

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Its UTF-8 so either convert the data to ISO-8859-1 or fix your HTML.

[2008-09-23 18:56 UTC] brett at brettbrewer dot com

Thanks very much for looking at this in spite of it being improperly submitted, but I'm a little confused by your reply. The feed in question is definitely UTF-8, it's being passed to SimpleXML as UTF-8. I can echo the raw feed before passing it to a SimpleXML object and the XML contains the proper character code (&#8217;)...then I immediately feed that XML into a SimpleXML object and do a print_r and the feed has been magically transformed and all instances of "&#8217;" have been replace by "??s" before it ever hits an HTML page. Are you telling me that the "??s" is the proper UTF-8 encoded representation of &#8217;? The page itself has an XHTML Transitional doctype and meta description defining the charset as UTF-8, so I'm not sure what you're suggesting I fix in my HTML. I also don't see why I would convert anything to ISO-8859-1 since it doesn't appear that SimpleXML can parse a feed in ISO-8859-1 format. It's certainly possible I've overlooked something...perhaps I'm missing something obvious here, but I've posted this question on the phpbuilder.com forums (http://www.phpbuilder.com/board/showthread.php?p=10887201) and at least one fairly expert poster there is stumped on this and agreed that it must be a bug...otherwise I wouldn't have posted it as a bug report.

[2008-09-23 20:18 UTC] rrichards@php.net

No bug. use a utf-8 enabled terminal to view it properly. You have bad 
HTML and forcing the browser to use ISO-8859-1 (you even explicitly set 
the charset to it) hence the garbage.

[2008-09-23 20:45 UTC] brett at brettbrewer dot com

This morning after your reply, I did discover a second duplicate meta declaration below the first (which was incorrectly set to ISO-8859-1), but it has since been removed and there is no difference in the behavior of the page. I now see nowhere in my HTML where there is any reference  to ISO-8859-1. There is a UTF-8 meta declaration at the top of the page. I'm now wondering if this might actually be due to Apache HTTP headers overriding the encoding declared in the HTML file because when I watch the headers via HTTPLiveHeaders Apache is sending ISO-8859-1 as the encoding, but the page itself has UTF-8 defined in the HTML meta. So could this be caused by Apache's headers taking precedence over the encoding defined in the HTML? Thanks again for your replies.

	php.net \| support \| documentation \| report a bug \| advanced search \| search howto \| statistics \| random bug \| login
go to bug id or search bugs for


Copyright © 2001-2025 The PHP Group All rights reserved.	Last updated: Wed Jul 02 12:01:36 2025 UTC