php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #41339 DomDocument->loadHTML eats HTML without error with multiple meta information
Submitted: 2007-05-09 15:58 UTC Modified: 2007-05-16 11:46 UTC
Votes:2
Avg. Score:4.5 ± 0.5
Reproduced:2 of 2 (100.0%)
Same Version:2 (100.0%)
Same OS:2 (100.0%)
From: rasch at raschnet dot com Assigned:
Status: Not a bug Package: DOM XML related
PHP Version: 5.2.2 OS: Ubuntu Linux
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: rasch at raschnet dot com
New email:
PHP Version: OS:

 

 [2007-05-09 15:58 UTC] rasch at raschnet dot com
Description:
------------
In usage of symfony, our code was mistakenly producing a meta tag with two content types.  However, from what I understand it's not invalid, but either way PHP falls on this, the DOM parser should return an error.  The current behavior is that PHP returns an empty string when calling '$dom->saveHTML()' in the code sample below.



Reproduce code:
---------------
$dom = new DomDocument("1.0", "utf-8");                                         
$val =$dom->loadHTML('                                                          
<html>                                                                          
<head>                                                                          
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8, text/html; charset=utf-8">                                                                
</head>                                                                         
<body>Hello</body></html>');                                                    
var_dump($val);                                                                 
print $dom->saveHTML();   
print "\n^^^ empty string\n";

Expected result:
----------------
<html>                                                                          
<head>                                                                          
   <meta http-equiv="Content-Type" content="text/html; charset=utf-8, text/html; charset=utf-8">                                                                
</head>                                                                         
<body><p>Hello</p></body></html>

Actual result:
--------------
bool(true)

// ^^^ empty string

Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2007-05-09 23:40 UTC] iliaa@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

The parser provided by libXML is not an HTML tag validator, it only 
cares about the syntax of tags being valid.
 [2007-05-11 01:00 UTC] rasch at raschnet dot com
If you can, please take another look at this.


I think parsing the HTML would be above and beyond the bug here..  In fact, the parser _is_ parsing some of the HTML to get the charset out of the content-type meta tag.  Unfortunately, it seems if the content-type isn't in the expected format, it's returning nothing.  It's not returning the ill-formed HTML back, but nothing.

If one alters the content-type meta tag to include just one content-type value it will happily return the html.
 [2007-05-15 02:04 UTC] rasch at raschnet dot com
I've decided to have one more go at this bug submission.  As a bit of evidence for this bugs validity, I offer that the HTML which causes the DOMDocument class to return no results in fact validates in the W3C validator.  

Either way, DOMDocument->saveHTML should not return an empty string.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd">
<html><head>                                                                 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8,
text/html; charset=utf-8"><title>Foo</title></head>                                                               <body><p>Hello</p></body>
</html>

Thanks!
David
 [2007-05-16 11:46 UTC] rrichards@php.net
Sorry, but your problem does not imply a bug in PHP itself.  For a
list of more appropriate places to ask for help using PHP, please
visit http://www.php.net/support.php as this bug system is not the
appropriate forum for asking support questions.  Due to the volume
of reports we can not explain in detail here why your report is not
a bug.  The support channels will be able to provide an explanation
for you.

Thank you for your interest in PHP.

Not a PHP issue - libxml2 is unable to determine the encoding to use for output due to the bogus charset
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Mon Apr 29 21:01:30 2024 UTC