php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #47210 Re: DOMDocument's inferior parsing of malformed HTML
Submitted: 2009-01-24 17:26 UTC Modified: 2009-01-25 06:24 UTC
From: queen dot zeal at gmail dot com Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.2.8 OS:
Private report: No CVE-ID: None
 [2009-01-24 17:26 UTC] queen dot zeal at gmail dot com
Description:
------------
Re: http://bugs.php.net/bug.php?id=47209

I don't agree with the closure of the bug report and since it would appear that I cannot make further posts since the report has been closed, I'm making a new one.

Anyway, that the HTML in the above bug report is invalid doesn't mean that it's processing can't be improved upon.  Dismissing this as even a bug is rather like dismissing a buffer overflow in unserialize() because the input is invalid.  In both cases, the processing of invalid input needs improvement.  In this case, it's not a security issue, but that doesn't mean it's not an issue, all the same.

Closing the bug report saying, simply, that it's invalid HTML, and leaving it at that is rather like the Firefox developers saying "it's invalid HTML" and leaving it at that.  Seriously, half the web contains invalid HTML - browsers parse it, for the most part, without any problems.  Maybe you think Firefox should just refuse to render any HTML that's invalid?  If they did that, I can guarantee you that people would begin to abandon Firefox left and right and Internet Explorer would regain the market dominance they once had.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2009-01-24 17:29 UTC] queen dot zeal at gmail dot com
My suggestion: use Gecko instead of libxml.
 [2009-01-24 17:38 UTC] scottmac@php.net
As Rob said, it's invalid HTML and not the goal of the extension, we're not trying to be a web browser here.

If you think we should change it then we welcome you patches so we can do this.
 [2009-01-24 19:39 UTC] pajoye@php.net
hint of the day: tidy
 [2009-01-25 06:24 UTC] chregu@php.net
And to hopefully put a final nail into this coffin:

Both Firefox/Gecko (checked with firebug) and tidy do build a DOM out of 
that tagsoup which is the same as libxml does:

<div>
<form action=""><input type="text" name="a"></form>
</div>

and not what the original author wants.

But yes, tidy sure does usually a better job than libxml2, since it's 
exactly build for tidying up HTML


 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Tue May 07 21:01:30 2024 UTC