|  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70450 DOM::loadHTML() --> ::saveXML() entity encodes CR
Submitted: 2015-09-07 14:00 UTC Modified: 2021-03-12 18:51 UTC
From: flavio dot cambraia at yahoo dot com dot br Assigned: cmb (profile)
Status: Wont fix Package: DOM XML related
PHP Version: 5.5.29 OS: Windows
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please — but make sure to vote on the bug!
Your email address:
Solve the problem:
41 - 40 = ?
Subscribe to this entry?

 [2015-09-07 14:00 UTC] flavio dot cambraia at yahoo dot com dot br
The output shows an entity 
 for every line break in $contents var.
I am using PHP 5.5.8 build date Jan 8 2014 15:26:26

Test script:
$contents = '<!DOCTYPE html>
<html lang="pt"><head><meta charset="utf-8"></head>
<div class="div_entry">
<div class="div_imagem">
<a href="test.html">	<img src="test.png" alt="" />Link to imagem</a>
<div class="div_product">This is a title for a product</div>
<div class="div_price">R$50,00</div>
$doc     = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $contents);
$xpath         = new DOMXpath($doc);
$xquery = '//div[@class="div_entry"]';
$articles  = $xpath->query($xquery);
$registros = array();
foreach ($articles as $i => $article) {  $registros[] = $article->ownerDocument->saveXML($article); } // end foreach  
echo "<pre>"; print_r($registros); echo "</pre>";


Add a Patch

Pull Requests

Add a Pull Request


AllCommentsChangesGit/SVN commitsRelated reports
 [2015-09-07 14:24 UTC]
-Summary: DOMXpath adds &#13; to string +Summary: DOM::loadHTML() --> ::saveXML() entity encodes CR
 [2015-09-07 14:24 UTC]
This has nothing to do with XPath. Rewriting the code to use
"classic" DOM methods yields the same result.
 [2021-03-12 18:51 UTC]
-Status: Open +Status: Wont fix -Assigned To: +Assigned To: cmb
 [2021-03-12 18:51 UTC]
The culprit is reading HTML and writing as XML.  libxml2 escapes
CR when writing XML because normally CRLF is folded to LF when
reading XML (required for conforming parsers).  As such, this
behavior is not wrong per se.  I consider the presented use-case
as quite uncommon, and I don't think this should be worked around
in PHP, especially since you can work around this by not having
CRLF as line endings, or if you have no control over the
documents, still can fold the CRLF manually (e.g. by using
PHP Copyright © 2001-2023 The PHP Group
All rights reserved.
Last updated: Wed Mar 29 05:03:39 2023 UTC