php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30257 Unexpected result of xml_get_current_byte_index and xml_get_current_column_numb
Submitted: 2004-09-27 20:36 UTC Modified: 2005-08-06 17:00 UTC
Votes:7
Avg. Score:4.7 ± 0.5
Reproduced:6 of 6 (100.0%)
Same Version:3 (50.0%)
Same OS:5 (83.3%)
From: christoffer at natlikan dot se Assigned: rrichards (profile)
Status: Not a bug Package: XML related
PHP Version: 5CVS-2005-02-02 OS: *
Private report: No CVE-ID: None
 [2004-09-27 20:36 UTC] christoffer at natlikan dot se
Description:
------------
xml_get_current_byte_index and xml_get_current_column_number returns unexpected values when the XML contains a XML declaration. Using php5.0-win32-200409270830 and Apache/1.3.31.


Reproduce code:
---------------
<?php
	function elementOpen($parser, $elementName, $attributes) {
		echo("ElementOpen - Row: " . xml_get_current_line_number($parser) .
			" Col: " . xml_get_current_column_number($parser) .
			" BIndex: " . xml_get_current_byte_index($parser) . "<br />");
	}
	
	function elementClose($parser, $elementName) {
		echo("ElementClose - Row: " . xml_get_current_line_number($parser) .
			" Col: " . xml_get_current_column_number($parser) .
			" BIndex: " . xml_get_current_byte_index($parser) . "<br />");
	}
	
	$parser = xml_parser_create();
	xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);
	xml_set_element_handler($parser, "elementOpen", "elementClose");

	$xml = 
		"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n" .
		"<a b=\"x\">\n" .
			"\ttest\n" .
			"\t<c>\n" .
				"\t\t<d>foo</d>\n" .
			"\t</c>\n" .
		"</a>";
	
	xml_parse($parser, $xml);
	xml_parser_free($parser);
?>

Expected result:
----------------
ElementOpen - Row: 2 Col: 10 BIndex: 52
ElementOpen - Row: 4 Col:  8 BIndex: 63
ElementOpen - Row: 5 Col: 11 BIndex: 69
ElementClose - Row: 5 Col:  9 BIndex: 73
ElementClose - Row: 6 Col:  2 BIndex: 79
ElementClose - Row: 7 Col:  1 BIndex: 85

Actual result:
--------------
ElementOpen - Row: 2 Col:  5 BIndex: 11
ElementOpen - Row: 4 Col:  8 BIndex: 22
ElementOpen - Row: 5 Col: 11 BIndex: 28
ElementClose - Row: 5 Col: 15 BIndex: 36
ElementClose - Row: 6 Col: 18 BIndex: 42
ElementClose - Row: 7 Col: 21 BIndex: 47

Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-09-30 01:27 UTC] olivier at samalyse dot com
I'm experiencing similar troubles with xml_get_current_byte_index(). But I don't agree with the expected result christoffer proposes.

Actually, in php4 xml_get_current_byte_index() behaves perfectly to me. Using the test code below with php version 4.3.4 produces :

ElementOpen - Row: 2 Col: 0 BIndex: 44
ElementOpen - Row: 4 Col: 1 BIndex: 61
ElementOpen - Row: 5 Col: 2 BIndex: 67
ElementClose - Row: 5 Col: 8 BIndex: 73
ElementClose - Row: 6 Col: 1 BIndex: 79
ElementClose - Row: 7 Col: 0 BIndex: 84

Byte Index 44 points at the beginning of the <a> tag : 
       <a b="x">
       ^

That is fine.

Now, if you omit the xml declaration '<?xml version="1.0" encoding="ISO-8859-1"?>', using php5, you will obtain :

ElementOpen - Row: 1 Col: 5 BIndex: 8
ElementOpen - Row: 3 Col: 8 BIndex: 19
ElementOpen - Row: 4 Col: 11 BIndex: 25
ElementClose - Row: 4 Col: 15 BIndex: 33
ElementClose - Row: 5 Col: 18 BIndex: 39
ElementClose - Row: 6 Col: 21 BIndex: 44

Byte index 8 does not point at the beginning of the tag anymore, but at its end :
       <a b="x">
               ^

In my particular case (XML indexing/marshalling) that's far less usable. Some may consider that's no bug, but it breaks backward compatibility with php4.

Now, if you let the xml declaration in place, this function does not produce anything relevant anymore. As Christoffer reports, parsing this xml document when it includes '<?xml version="1.0" encoding="ISO-8859-1"?>' will produce :

ElementOpen - Row: 2 Col:  5 BIndex: 11
ElementOpen - Row: 4 Col:  8 BIndex: 22
ElementOpen - Row: 5 Col: 11 BIndex: 28
ElementClose - Row: 5 Col: 15 BIndex: 36
ElementClose - Row: 6 Col: 18 BIndex: 42
ElementClose - Row: 7 Col: 21 BIndex: 47

In this later case, what seems to happen is that the xml declaration byte length is badly evaluated. If you add to this the fact that the returned byte index points at the end of the tag (php5) instead of the beginning of the tag (php4), it really starts to look like random output...
 [2005-08-01 00:37 UTC] sniper@php.net
This difference is caused by the fact that in PHP5 we use libxml instead of expat by default. 

To get the expat behaviour, you can always compile PHP with
--with-expat-dir=/path/to/expat 

This change in behaviour needs to be documented though.

 [2005-08-04 11:44 UTC] vrana@php.net
It's so weird that it should be rather fixed than documented:

All functions point to the end of a tag instead of beginning. (it can be documented)

xml_get_current_byte_index() and xml_get_current_column_number() behave unpredictable (it should be fixed):

<?xml version='1.0' encoding='us-ascii'?><a></a> CN=42, BI=4
<?xml version='1.0' encoding='us-ascii' ?><a></a> CN=42, BI=5
<?xml version='1.0'  encoding='us-ascii'?><a></a> CN=42, BI=4
<?xml version='1.0' encoding='utf-8'?><a></a> CN=39, BI=40

Another problem is with attribute values and whitespace. They are not counted to CN (it should be fixed):

<?xml version='1.0' encoding='utf-8'?><a b=''></a> CN=41, BI=45
<?xml version='1.0' encoding='utf-8'?> <a b=''></a> CN=41, BI=46
<?xml version='1.0' encoding='utf-8'?><a  b=''></a> CN=41, BI=46
<?xml version='1.0' encoding='utf-8'?><a b='cde'></a> CN=41, BI=48

Such a weird behavior is nearly undocumentable and unusable for sure.
 [2005-08-06 01:34 UTC] sniper@php.net
Rob, you had some nice explanation about this, iirc?
IIRC, this is libxml issue, not something we can fix?

 [2005-08-06 17:00 UTC] rrichards@php.net
It's a libxml issue and wasnt able to find a way to work around any of these. I have a list of issues going I'll be submitting for documentation.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Wed Dec 04 18:01:31 2024 UTC