php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #63430 xml data parsing bug
Submitted: 2012-11-03 17:23 UTC Modified: 2012-11-21 14:30 UTC
From: lussenburg_rm at hotmail dot com Assigned:
Status: Not a bug Package: XML Reader
PHP Version: Irrelevant OS: windows 7
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: lussenburg_rm at hotmail dot com
New email:
PHP Version: OS:

 

 [2012-11-03 17:23 UTC] lussenburg_rm at hotmail dot com
Description:
------------
---
From manual page: http://www.php.net/xmlreader.read#refsect1-xmlreader.read-description
---
The bug isn't realy in the code so im not including any script here, but it is related to the xml input. For example i'm reading some rss feeds (note that i neither compose, nor responsible for the layout) that look like this:

<rss>
 <channel>
  <title>feed title</title>
  <description>feed description</description>
  <pubDate>Mon, 29 Oct 2012 13:30:00 +0100</pubDate>
  <item>
    <title>item title</title>
    <description>item description</description>
    <link>http://itemlink</link>
  </item>
  <item>
    <title>item title</title>
    <description>item description</description>
    <link>http://bla</link>
  </item>
  ...
 </channel>
</rss>

Everything was working perfectly fine until i kept getting values from the first 'item title' and 'item description' in the 'feed title' and 'feed description' node values. When i examined the xml data i found out that it only happens when the first <item> tag directly follows the last of the <channel> nodes (<title>, <description>, <pubDate> etc) without a carriage return/newline.
To work around this, before passing the data to XMLReader::xml(), i replace all occurences of "><item>" with ">\r\n<item>", which works fine, but maybe it could be resolved so this workaround isn't neccesary anymore.



Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2012-11-07 19:50 UTC] mail+php at requinix dot net
Even if the input is "faulty" example code is still important. For all we know 
it's a complex problem you're triggering because of something subtle in your 
code.

I can't reproduce it with

<?php
$xml = <<<XML
<rss>
 <channel>
  <title>feed title</title>
  <description>feed description</description>
  <pubDate>Mon, 29 Oct 2012 13:30:00 +0100</pubDate><item>
    <title>item title</title>
    <description>item description</description>
    <link>itemlink</link>
  </item>
 </channel>
</rss>
XML;

$reader = new XMLReader();
$reader->xml($xml);

// http://www.php.net/manual/en/class.xmlreader.php#88264
function xml2assoc($xml) { removed for brevity }

print_r(xml2assoc($reader));
?>

PHP 5.4.3 and libxml 2.7.7
 [2012-11-20 20:30 UTC] lussenburg_rm at hotmail dot com
Hi there,

This code is for testing purposes so i could learn how XMLReader() works before incorporating it in a RssWebfeed class i've written.
In this code the only thing i replace, to work around the bug i got, is the bit that is commented out in this example. 'nosnieuwsalgemeen.xml' is the file I have saved on my pc so i don't have to read it from internet everytime. It is the contents of http://feeds.nos.nl/nosnieuwsalgemeen. Another example is http://www.nasa.gov/rss/breaking_news.rss, but this one doesn't give the bug.
In the implementation, I need to get the data that comes before the first <item> into a feed database which identifies different feed id's and its title and description. When i encounter the first <item> these are records that go into a 2nd database which defines items for a particular feed.


Here's the code:


/*
$find = array (
	'<![CDATA[', ']]>', '><item>'
);
$repl = array (
	'',          '',    '>\r\n<item>'
);
*/

$file = 'nasa_breaking_news.xml';

$cont = file_get_contents($file);
//$cont = str_ireplace($find, $repl, $cont);

$nodes = array (
	'rss'            => array( 'version' => 'rss_version' ),
	'guid'           => true,
	'link'           => true,
	'title'          => true,
	'description'    => true,
	'pubDate'        => true,
	'lastBuildDate'  => true,
	'language'       => true,
	'image'          => true,
	'enclosure'      => array( 'url' => 'enclosure', 'type' => 'type', 'width' => 'imgwidth' ),
	'managingEditor' => true,
	'related'        => true,
);

$siblings = array (
	'image' => array( 'url' => 'image', 'title' => 'alt', 'link' => 'link', 'description' => 'title' ),
);

$xml = new XMLReader();

if ( $xml ) {
	echo '
	<div class="e large">xml = new XMLReader()</div>
	<div>gelukt</div>
	<br>';
}

if ( $xml->xml($cont, THIS_CHARSET, LIBXML_NOERROR|LIBXML_NOWARNING) === true ) {
	printf( '
	<div class="e large">xml->open()</div>
	<div>%s</div>
	<br>',
	$file
	);

	echo '
	<br>';

	$mode        = 0;
	$element     = '';
	$itemcount   = 0;

	while ( $xml->read() ) {

		if ( $xml->name == 'item' ) {
			switch ( $xml->nodeType ) {
			case XMLReader::ELEMENT:
				$itemcount++;
				$mode = 1;
				break;
			case XMLReader::END_ELEMENT:
				$mode = 0;
				break;
			}
		}

		$element = '';

		switch ( $xml->nodeType ) {
		case XMLReader::END_ELEMENT:
		case XMLReader::SIGNIFICANT_WHITESPACE:
		case XMLReader::WHITESPACE:
		case XMLReader::TEXT:
		case XMLReader::CDATA:
			continue 2;
		}

  		printf( '
		<br>
		<div style="padding-left:%uem;">
		<div class="e large">xml->read():</div>
		<div>xml->name: %s%s</div>
		<div>xml->nodeType: %d</div>
		<div>xml->isEmpty: %s</div>
		<div>xml->hasvalue: %s</div>
		<div>xml->attr: %s</div>
		<div>xml->depth: %d</div>',
		$mode+1,
		$xml->name,
		$xml->name=='item' ? sprintf(' (rec#: %u)', $itemcount) : '',
		$xml->nodeType,
		$xml->isEmptyElement ? "yes" : "no",
		$xml->hasValue ? "yes" : "no",
		$xml->hasAttributes ? $xml->attributeCount : "no",
		$xml->depth
		);

		if ( !$nodes[$xml->name] ) {
			echo '
			</div>';
			continue;
		}

		switch ( $xml->nodeType ) {
		case XMLReader::ELEMENT:
			$element = $xml->name;
			printf( '
			<div%s>',
			$nodes[$xml->name] ? ' class="grey"' : ''
			);
			if ( $nodes[$xml->name] === true ) {
				printf( '
				<div>INNER: %s</div>',
				$xml->readInnerXML()
				);
			}
			if ( $node = $xml->expand() ) {
				printf( '
				<div>node->name: %s</div>',
				$node->nodeName
				);
				printf( '
				<div>node->childs: %s</div>',
				$node->hasChildNodes() ? "".$node->childNodes->length : "no"
				);
				if ( $xml->hasAttributes && $node->attributes !== null ) {
					echo '
					<div>node->attr: ';
					for ( $i = 0; $i < $xml->attributeCount; $i++ ) {
						$item = $node->attributes->item($i);
						if ( $nodes[$xml->name][$item->nodeName] ) printf('[%s=%s]', $nodes[$xml->name][$item->nodeName], $item->nodeValue);
					}
					echo '
					</div>';
				}
				if ( $node->hasChildNodes() && $siblings[$node->nodeName] ) {
					echo '<div>node->items:';
					for ( $i = 0; $i < $node->childNodes->length - 1; $i++ ) {
						$item = $node->childNodes->item($i);
						if ( $item->nodeType == XMLReader::ELEMENT && $siblings[$node->nodeName][$item->nodeName]) {
							echo '['.$siblings[$node->nodeName][$item->nodeName].'='.$item->nodeValue.']';
						}
					}
					echo '</div>';
				}
				if ( $node->hasChildNodes() && ($mode == 1 || $siblings[$node->nodeName]) ) $xml->next();
			}
			echo '
			</div>';
			break;
		}
		echo '
		</div>';
	}

	$ret = $xml->close();

	printf( '
	<br>
	<div class="bordertop">
	<div class="e large">xml->close():</div>
	<div>%sgelukt</div>
	</div>',
	$ret===false ? 'niet ' : ''
	);

}
 [2012-11-20 21:44 UTC] mail+php at requinix dot net
Hate to burst your bubble but there's a flaw in your code. The problem occurs 
when
* There is a node before an <item> with no whitespace (ie, a #text) in between
* Said node has children
* Said node has an entry in $siblings

The last two cause a line of code near the bottom

if ( $node->hasChildNodes() && ($mode == 1 || $siblings[$node->nodeName]) )
  $xml->next();

to fire. next() will skip over the rest of the node and, in lieu of a subsequent 
#text, advance to the <item>. But at the top of your loop you have a read(). 
That 
will skip over the tag and into the following #text (between the <item> and the 
<title>). You can confirm this by outputting the node name at the beginning of 
the 
loop - before the switch that would skip over it: <image>, then #text, then 
<title>.

It works for me if I change the while loop into a do/while:
* $xml->read() before the loop to initialize
* flag=false at the start of the loop
* the aforementioned line sets flag=$xml->next()
* do/while ( flag || $xml->read() )

If you'd like to know more you can email me at this address.
 [2012-11-21 11:32 UTC] lussenburg_rm at hotmail dot com
That does work indeed, thanks. I guess i misunderstood the explanation of next(). i didn't expect it to skip over the beginning <tag> of a new element. i thougt it would only skip over all subtrees of the current element, and that the read at the top of the loop would start at the <item> element.

Compliments on the 'super fast' reply also !
 [2012-11-21 11:33 UTC] lussenburg_rm at hotmail dot com
-Status: Open +Status: Closed
 [2012-11-21 11:33 UTC] lussenburg_rm at hotmail dot com
.
 [2012-11-21 14:30 UTC] rasmus@php.net
-Status: Closed +Status: Not a bug
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Fri Dec 04 08:01:23 2020 UTC