php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #70450 DOM::loadHTML() --> ::saveXML() entity encodes CR
Submitted: 2015-09-07 14:00 UTC Modified: 2021-03-12 18:51 UTC
From: flavio dot cambraia at yahoo dot com dot br Assigned: cmb (profile)
Status: Wont fix Package: DOM XML related
PHP Version: 5.5.29 OS: Windows
Private report: No CVE-ID: None
Have you experienced this issue?
Rate the importance of this bug to you:

 [2015-09-07 14:00 UTC] flavio dot cambraia at yahoo dot com dot br
Description:
------------
The output shows an entity 
 for every line break in $contents var.
I am using PHP 5.5.8 build date Jan 8 2014 15:26:26


Test script:
---------------
<?php
$contents = '<!DOCTYPE html>
<html lang="pt"><head><meta charset="utf-8"></head>
<body>
<div class="div_entry">
<div class="div_imagem">
<a href="test.html">	<img src="test.png" alt="" />Link to imagem</a>
</div>
<div class="div_product">This is a title for a product</div>
<div class="div_price">R$50,00</div>
</div>';
$doc     = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $contents);
$xpath         = new DOMXpath($doc);
$xquery = '//div[@class="div_entry"]';
$articles  = $xpath->query($xquery);
$registros = array();
foreach ($articles as $i => $article) {  $registros[] = $article->ownerDocument->saveXML($article); } // end foreach  
echo "<pre>"; print_r($registros); echo "</pre>";
?>


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2015-09-07 14:24 UTC] cmb@php.net
-Summary: DOMXpath adds &#13; to string +Summary: DOM::loadHTML() --> ::saveXML() entity encodes CR
 [2015-09-07 14:24 UTC] cmb@php.net
This has nothing to do with XPath. Rewriting the code to use
"classic" DOM methods yields the same result.
 [2021-03-12 18:51 UTC] cmb@php.net
-Status: Open +Status: Wont fix -Assigned To: +Assigned To: cmb
 [2021-03-12 18:51 UTC] cmb@php.net
The culprit is reading HTML and writing as XML.  libxml2 escapes
CR when writing XML because normally CRLF is folded to LF when
reading XML (required for conforming parsers).  As such, this
behavior is not wrong per se.  I consider the presented use-case
as quite uncommon, and I don't think this should be worked around
in PHP, especially since you can work around this by not having
CRLF as line endings, or if you have no control over the
documents, still can fold the CRLF manually (e.g. by using
str_replace()).
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Thu May 06 22:02:20 2021 UTC