php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30887 XML parser stop at data when get first character as (?????? ans so ...)
Submitted: 2004-11-24 21:08 UTC Modified: 2004-11-30 07:22 UTC
Votes:1
Avg. Score:5.0 ± 0.0
Reproduced:0 of 1 (0.0%)
From: maddam at volny dot cz Assigned:
Status: Not a bug Package: *XML functions
PHP Version: 5.0.2 OS: Win XP
Private report: No CVE-ID: None
 [2004-11-24 21:08 UTC] maddam at volny dot cz
Description:
------------
xml file element:
<data>Jak se m?? holoub?tko ?</data>
This is Czech language with special characters.

<?php
function characterData($parser, $data)
$getdata = $data;

echo $getdata must show 'Jak se m?? holoub?tko ?'

But the parser stop and $getdata will consist of 'Jak se m'
and at the next step on same element parser will get all last text and $getdata will consist of '?? holoub?tko ?'

This is bug for 5.0.2. In 4.3.9 and sooner is all OK. 5.0.0 and 5.0.1 i was not tested.

Description: The parser when get data with language characters as (?????????) will cut this data to two parts. First part consist of characters to first occurence of some character (?????????) and second part consist of spare element.

THIS BUG WILL NOT SHOW FOR ENGLISH LANGUAGE WHICH NOT USE CHARACTERS AS ?????????

Sorry for my english i hope you understand. Contact me at maddam@volny.cz or ICQ 25684007

Reproduce code:
---------------
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rokam [
	<!ELEMENT data (#PCDATA)>
]>
<rokam>
 <data>Jak se m??</data>
 <data>Zde doma je dobr? ocet</data>
</rokam>

<?php
function characterData($parser, $data)
$getdata = $data;
echo $getdata . <br />;

Expected result:
----------------
Echo on screen, need two steps through function characterData:

Jak se m??
Zde doma je dobr? ocet

Actual result:
--------------
This output of parser 5.0.2 need four steps through function characterData and will output:


Jak se m
??
Zde doma je dobr
? ocet

-------------------------------------------------
This BUG can be repaired with this code, who connect two parts from parser to one variable say $data. This code
connect 'Jak se m' with '??'

function characterData($parser, $data) {
        global $currentTag;

// <code for repair start>
        global $lastdata, $lastTag;
	if (strcmp($lastTag, $currentTag) == 0) {
            $data = $lastdata . trim($data);
            $lastdata = $lastTag = '';
        }else{
	    $lastdata = $data;
	    $lastTag = $currentTag;
	    return;
        }
// <code for repair end>

here can continue normal code for function characterData
//

see trim($data) must be there - the parser add to end of the string $data of first part CR(x0D) LF(x0E) (I think)  and must be trimed for code to properly work.


Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-11-24 21:28 UTC] derick@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

This is expected, and definitely not wrong. There is never said that XML parsers can\'t break up CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just how an XML parser might handle it (and the new libxml2 we have in PHP 5 does it like this).
 [2004-11-29 22:26 UTC] maddam at volny dot cz
Yes I will not rely on getting one event for each CDATA section now for PHP5 and above.

But explain me please this behavior of XML parser for this four CDATA sections:

 

function characterData($parser, $data) {

            echo $data . '<br />'; 

}

 

1.

------------------------------------------------------------------------------------------------------------

<data>This is expected, and definitely not wrong. There is never said that XML parsers can't break up 

CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just 

how an XML parser might handle it (and the new libxml2 we have in PHP 5 does it like this).</data>

 

echo is same, the all sentence in one

$data:

This is expected, and definitely not wrong. There is never said that XML parsers can't break up 

CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just 

how an XML parser might handle it (and the new libxml2 we have in PHP 5 does it like this).

 

2.

------------------------------------------------------------------------------------------------------------

<data>This is exp?cted, and definitely not wrong. There is never said that XML parsers can't break up 

CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just 

how an XML parser might handle it (and the new libxml2 we have in PHP 5 does it like this).</data>

 

here I add character '?' at the begening of CDATA (exp?cted)

echo is dividet into

 

$data first

This is exp

 

$data second

?cted, and definitely not wrong. There is never said that XML parsers can't break up CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just how an XML parser might handle it (and the new libxml2 we have in PHP 5 does it like this).

 

3.

------------------------------------------------------------------------------------------------------------

<data>This is expected, and definitely not wrong. There is never said that XML parsers can't break up 

CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just 

how an XML parser might handle it (and the new libxml2 we h?ve in PHP 5 does it like this).</data>

 

here I add character '?' at the end of CDATA (h?ve)

echo is dividet into

 

$data first

This is expected, and definitely not wrong. There is never said that XML parsers can't break up CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just how an XML parser might handle it (and the new libxml2 we h

 

$data second

?ve in PHP 5 does it like this).

 

4.

------------------------------------------------------------------------------------------------------------

<data>This is exp?cted, and definitely not wrong. There is never said that XML parsers can't break up 

CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just 

how an XML parser might handle it (and the new libxml2 we h?ve in PHP 5 does it like this).</data>

 

here I add character '?' at the begening and at the end of CDATA (exp?cted and h?ve)

echo is divided into

 

$data first

This is exp

 

$data second

?cted, and definitely not wrong. There is never said that XML parsers can't break up CDATA sections, and you should never rely on getting only one event for each CDATA section. This is just how an XML parser might handle it (and the new libxml2 we h?ve in PHP 5 does it like this).

 

------------------------------------------------------------------------------------------------------------

Does not mather what long CDATA section is, the divided CDATA section strongly depend on first occurence of character '?' other occurencies of '?' parser do not touch. I think you are wrong and THIS IS BIG BUG ! PHP 4 is OK for this. I think many peoples who was used PHP4 and goes to PHP5 have big problems if they worked at other languages than English as me. My internet provider always goes to last software versions of PHP and now he has 5.0.2 and cannot for one person as me who is using XML with PHP go back to working PHP4. Instead off  I must use correction code for PHP5, what is watching division of $data and add them back to one $data. THE SECOND BUG FOR THIS IS when parser give back first $data when do division, he add CRLF at the end of first $data - for point 4: (This is expCRLF). IS NOT THAT BUG ? when parser add characters who is not present in CDATA section ? I THING THIS IS BIG BIG BUG !

 

When parser give back $data 'I am last' than strlen($data) = 9 but when $data is 'I am l?st' with '?' and you use say $getdata = $getdata . $data (for connection divided $data) you get strlen($getdata) = 11 andy put the question WHERE ARE 11 characters in <data>I am l?st</data> ? THIS BUG is BIG problem when you want to search occurence of CDATA string. You are looking for text  'I am l?st' in XML file with function strcmp 

 

if (strcmp($currentTag, "data") == 0) {

            $sentence = $data;

if (strcmp($sentence, "I am l?st") == 0) {

 // here you never find it because when you get two $data from parser (I am l) and (?st) you get string $sentence with strlen 11 not 9

}

        }

 

but you never find it!!!  I know what I am talking about. My own search problem.  My code definitely stop work with PHP5!

That it is my correction code must think about connection CDATA dividet into two sections and even must trim second BUG - trim all not printable characters from $data. Why to spend time for develop correction code when PHP4 works correct ? When you goes from PHP4 to PHP5 the all sites with PHP4 and XML definitely stop works (with special other Language characters not with real English). The text will bu cutted with PHP5 to first occurence of special Language character and all the search functions in XML with parser will not be capable to catch any looked string.

 

And must read and see this text above not only say your conventional ' Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://www.php.net/manual/ and the instructions on how to report a bug at http://bugs.php.net/how-to-report.php'. Right ?

 

5.

------------------------------------------------------------------------------------------------------------

Last question why I cant parse character & and get error as I see in other bug references ?

 

Thanks for me and other users of PHP5. Give you the question, why PHP5 parser stop at the first occurence of '?' and the all other occurences of '?' in CDATA section (does not matter what long CDATA it is) do not touch. Give you the question why the PHP4 parser does not have (FOR ME BUG) this feature and always give back all CDATA in one string $data - I think this behaviour of PHP4 is correct. Tell mi why the programmer will program any code, who will wait for adding CDATA section to one string ? I think this must do parser and for this the parser is made - this is main feature of parser - simply get data from CDATA to one string (not divided betwen two strings). Thanks for reply.

 

 

-----Original Message-----
From: Marek [mailto:maddam@volny.cz] 
Sent: Wednesday, November 24, 2004 11:45 PM
To: Marek.Darmo@byh.br.ds.mfcr.cz
Subject: rd

 

[24 Nov 9:28pm CET] derick@php.net 

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php
 
This is expected, and definitely not wrong. There is never said that XML
parsers can\'t break up CDATA sections, and you should never rely on
getting only one event for each CDATA section. This is just how an XML
parser might handle it (and the new libxml2 we have in PHP 5 does it
like this).
 [2004-11-30 07:22 UTC] chregu@php.net
You had to concatenate character_data as well in PHP 4, if you wanted to write a proper XML parser based on ext/xml.

In PHP 5, it just happens that the "breakup" happens on different places. Read the usercomments at http://www.php.net/manual/en/function.xml-set-character-data-handler.php and the dozens of tutorials available on the net

please do not reopen this "bug". It ain't no bug. This behaviour exists since ages.
 
PHP Copyright © 2001-2021 The PHP Group
All rights reserved.
Last updated: Mon Nov 29 04:03:13 2021 UTC