php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #37513 Have news submission system perform HTML sanity checks, prevent RSS breakage
Submitted: 2006-05-18 21:07 UTC Modified: 2010-12-20 12:12 UTC
From: edwardzyang at thewritingpot dot com Assigned: bjori (profile)
Status: Closed Package: Website problem
PHP Version: Irrelevant OS:
Private report: No CVE-ID: None
 [2006-05-18 21:07 UTC] edwardzyang at thewritingpot dot com
Description:
------------
The news items on PHP.net often are sloppily written, with improper encodings and unencoded ampersands. Perhaps you guys should get HTMLTidy to give HTML code a sanity check before allowing it to be submitted. This is even more important for RSS, because one stray ampersand can break the whole feed.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2006-05-18 21:11 UTC] goba@php.net
We are open for patches.
 [2006-05-18 21:14 UTC] edwardzyang at thewritingpot dot com
I'm investigating the source. However, it seems like the news items are inlined inside index.php. Factoring them out would present major architectural changes.
 [2006-10-05 23:18 UTC] edwardzyang at thewritingpot dot com
What file is the RSS parser located in? index.php mentions it but doesn't give any other indication.
 [2006-11-06 23:11 UTC] edwardzyang at thewritingpot dot com
The real parser is http://cvs.php.net/viewvc.cgi/php-master-web/scripts/rss_parser?view=markup

I need to know a little more about server setup before I write a patch. I could use Tidy (very quick and easy fix), but it may not be installed on PHP.net's servers, and thus not be a viable solution. If that is the case, we'll probably have to bundle in a library like HTML Purifier http://hp.jpsband.org/ to do the dirty work. If that seems like overkill, a quick:

$text = html_entity_decode($text, ENT_COMPAT, 'UTF-8');
$text = htmlspecialchars($text, ENT_COMPAT, 'UTF-8');

...would work on PHP 5, but since that's not deployed, you'd be better off replacing the html_entity_decode() with:

$text = str_replace(array('&amp;', '&quot;', '&lt;', '&gt;'), array('&', '"', '<', '>'), $text);

...and hope no one uses a non-special entity. If you want that covered too use HTML Purifier.
 [2007-06-07 18:12 UTC] bjori@php.net
See http://php.net/feed.atom
 [2010-12-20 12:12 UTC] jani@php.net
-Package: Tidy +Package: Website problem
 
PHP Copyright © 2001-2023 The PHP Group
All rights reserved.
Last updated: Tue Feb 07 06:05:51 2023 UTC