|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70450 DOM::loadHTML() --> ::saveXML() entity encodes CR
Submitted: 2015-09-07 14:00 UTC Modified: 2021-03-12 18:51 UTC
From: flavio dot cambraia at yahoo dot com dot br Assigned: cmb (profile)
Status: Wont fix Package: DOM XML related
PHP Version: 5.5.29 OS: Windows
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
Block user comment
Status: Assign to:
Bug Type:
From: flavio dot cambraia at yahoo dot com dot br
New email:
PHP Version: OS:


 [2015-09-07 14:00 UTC] flavio dot cambraia at yahoo dot com dot br
The output shows an entity 
 for every line break in $contents var.
I am using PHP 5.5.8 build date Jan 8 2014 15:26:26

Test script:
$contents = '<!DOCTYPE html>
<html lang="pt"><head><meta charset="utf-8"></head>
<div class="div_entry">
<div class="div_imagem">
<a href="test.html">	<img src="test.png" alt="" />Link to imagem</a>
<div class="div_product">This is a title for a product</div>
<div class="div_price">R$50,00</div>
$doc     = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $contents);
$xpath         = new DOMXpath($doc);
$xquery = '//div[@class="div_entry"]';
$articles  = $xpath->query($xquery);
$registros = array();
foreach ($articles as $i => $article) {  $registros[] = $article->ownerDocument->saveXML($article); } // end foreach  
echo "<pre>"; print_r($registros); echo "</pre>";


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2015-09-07 14:24 UTC]
-Summary: DOMXpath adds &#13; to string +Summary: DOM::loadHTML() --> ::saveXML() entity encodes CR
 [2015-09-07 14:24 UTC]
This has nothing to do with XPath. Rewriting the code to use
"classic" DOM methods yields the same result.
 [2021-03-12 18:51 UTC]
-Status: Open +Status: Wont fix -Assigned To: +Assigned To: cmb
 [2021-03-12 18:51 UTC]
The culprit is reading HTML and writing as XML.  libxml2 escapes
CR when writing XML because normally CRLF is folded to LF when
reading XML (required for conforming parsers).  As such, this
behavior is not wrong per se.  I consider the presented use-case
as quite uncommon, and I don't think this should be worked around
in PHP, especially since you can work around this by not having
CRLF as line endings, or if you have no control over the
documents, still can fold the CRLF manually (e.g. by using
PHP Copyright © 2001-2023 The PHP Group
All rights reserved.
Last updated: Sat Feb 04 22:04:09 2023 UTC