php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #27242 XML parser returs some weird outputs when you handle large files (500 MB++)
Submitted: 2004-02-13 12:35 UTC Modified: 2004-02-14 11:20 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:0 of 0 (0.0%)
From: amix at amix dot dk Assigned:
Status: Not a bug Package: XML related
PHP Version: 4.3.5RC2 OS: WIndows and Mac OS X
Private report: No CVE-ID: None
View Add Comment Developer Edit
Anyone can comment on a bug. Have a simpler test case? Does it work for you on a different platform? Let us know!
Just going to say 'Me too!'? Don't clutter the database with that please !
Your email address:
MUST BE VALID
Solve the problem:
44 - 12 = ?
Subscribe to this entry?

 
 [2004-02-13 12:35 UTC] amix at amix dot dk
Description:
------------
I am have made a script to parse the DMOZ RDF XML files. Which are HUGE (one is 500 MB and the other is 1.2 GB).

The major problem I've got is that the XML parser outputs some bullshit when parsing those large files!

The problem is how many bytes to read. I.e. this code:
      while ($data = fread($fp, 4096))

Now I have fixed this problem by loading the whole file into the memory:
      while ($data = fread($fp, filesize($this->xml_file)))
      {

It takes some minutes to loade the 500 MB file, but can't do that with the 1.2 GB file.

Ok, now I have searched a big deal on google. I have looked on how other parse XML files - and all I have seen use fread (including some PEAR scripts etc.)

I have also made an example which shows the code on a smaller scale. If you set fread to read 2 bytes per time - then it makes some weird output.

Reproduce code:
---------------
This example is taken from a book.
<?php 
$currentTag = ""; 

$fields = array(); 
$values = array(); 

$xml_file="data.xml"; 

function startElementHandler($parser, $name, $attributes) 
{
      global $currentTag, $table; 
      $currentTag = $name; 

      if (strtolower($currentTag) == "table") 
      {
            $table = $attributes["name"]; 
      } 

} 

function endElementHandler($parser, $name) 
{
      global $fields, $values, $count, $currentTag; 

      global $connection, $table; 

      if (strtolower($name) == "record") 
      {
            $query = "INSERT INTO $table"; 
            $query .= "(" . join(", ", $fields) . ")"; 
            $query .= " VALUES(\"" . join("\", \"", $values) . "\");"; 

          echo "$query\n";

            $fields = array(); 
            $values = array(); 
            $count = 0; 
            $currentTag = ""; 
      } 

} 

function characterDataHandler($parser, $data) 
{
      global $fields, $values, $currentTag, $count; 
      if (trim($data) != "") 
      {
            $fields[$count] = $currentTag; 

            $values[$count] = mysql_escape_string($data); 
            $count++; 
      } 
} 

$xml_parser = xml_parser_create(); 

xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE, TRUE); 


xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, FALSE); 

xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler"); 
xml_set_character_data_handler($xml_parser, "characterDataHandler"); 

if (!($fp = fopen($xml_file, "rb"))) 
{
      die("File I/O error: $xml_file"); 
} 

while ($data = fread($fp, 2)) 
{
      if (!xml_parse($xml_parser, $data, feof($fp))) 
      {
            $ec = xml_get_error_code($xml_parser); 
            die("XML parser error (error code " . $ec . "): " . xml_error_string($ec) . 
"<br>Error occurred at line " . xml_get_current_line_number($xml_parser)); 
      } 
} 

xml_parser_free($xml_parser); 


?> 

data.xml
<?xml version="1.0"?> 
<table name="readings"> 
      <record> 
            <a>56565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656</a> 
            <b>12565656565656565656565656565656565656565656565656565656565656565622</b> 
            <c>785656565656565656565656565656565656565656565656565656565656565656.5</c> 
      </record> 
      <record> 
            <x>456565656565656565656565656565656565656565656565656565656565656565</x> 
            <y>-565656565656565656565656565656565656565656565656565656565656565610</y> 
      </record> 
      <record> 
            <x>156565656565656565656565656565656565656565656565656565656565656562</x> 
            <b>105656565656565656565656565656565656565656565656565656565656565656459</b> 
            <a>7565656565656565656565656565656565656565656565656565656565656565656</a> 
            <y>95656565656565656565656565656565656565656565656565656565656565656</y> 
      </record> 
</table>


Expected result:
----------------
INSERT INTO readings(a, b, c) VALUES("56565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656565656", "12565656565656565656565656565656565656565656565656565656565656565622", "785656565656565656565656565656565656565656565656565656565656565656.5"); INSERT INTO readings(x, y) VALUES("456565656565656565656565656565656565656565656565656565656565656565", "-565656565656565656565656565656565656565656565656565656565656565610"); INSERT INTO readings(x, b, a, y) VALUES("156565656565656565656565656565656565656565656565656565656565656562", "105656565656565656565656565656565656565656565656565656565656565656459", "7565656565656565656565656565656565656565656565656565656565656565656", "95656565656565656565656565656565656565656565656565656565656565656");


Actual result:
--------------
INSERT INTO readings(a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c, c) VALUES("56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "12", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "22", "78", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", ".5"); INSERT INTO readings(x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y) VALUES("4", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "5", "-", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "10"); INSERT INTO readings(x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, x, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, b, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, a, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y, y) VALUES("1", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "2", "1", "05", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "64", "59", "75", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "65", "6", "9", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56", "56");


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-02-14 11:04 UTC] sniper@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Of course you get strange results if you try to parse 2 bytes per time..

 [2004-02-14 11:20 UTC] amix at amix dot dk
"Of course you get strange results if you try to parse 2 
bytes per time.."

Please the whole text - the problem isn't that I parse 2 
bytes per time - but my main problem is that same 
problem happens if I parse large XML files!

If you parse huge files, you get same output error - and 
it does not matter if you use 2 bytes per time or 100000 
(trust me I have tried it).

How do I fix this problem - by NOT loading the whole 
file in the memory?? It seems to me that the problem is 
located in the php-xml-parse core.
 [2004-02-14 17:14 UTC] adam at trachtenberg dot com
Your character data handler receives data two bytes at a 
time, so you need to use ".=" instead of "=" when you 
do: $values[$count] = mysql_escape_string($data).
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Fri Apr 19 22:01:28 2024 UTC