php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Bug #30846 non ASCII characters
Submitted: 2004-11-20 01:08 UTC Modified: 2004-12-17 12:08 UTC
From: migmam at ya dot coom Assigned:
Status: Not a bug Package: XML related
PHP Version: 5.0.2 OS: Windows 2000
Private report: No CVE-ID: None
 [2004-11-20 01:08 UTC] migmam at ya dot coom
Description:
------------
Hi,

I have upgraded to PHP5 and an old module that reads an XML file doesn't work now.
I use SAX to read the file and everything works fine until a non ASCII characters is found.
When it finds a non ascii character (spanish characters in my case -?????-) it splits the element in to two different ones. For example, the word "Informaci?n" is divided into  "Informaci" and "?n".
I have indicated in the XML document heading the type  encoding="ISO-8859-1"
I have saved it like codified as ISO-8859-1 text file.
In the parser option I have specified ISO-8859-1 (XML_OPTION_TARGET_ENCODING)

And it doesn't work

Best regards,

Miguel Angel



Patches

Pull Requests

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2004-11-20 01:51 UTC] derick@php.net
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

The default encoding is utf8 now (this is a change), but it is expected and documented.
 [2004-11-22 10:08 UTC] migmam at ya dot coom
Hello again.
Thank you very much for your answer.
So if I change the default encoding to "ISO-8859-1" in php.ini (default_charset = "ISO-8859-1") it must work. But it doesn't.

Note: Sorry for not copying all code. But it is the standard code copied from php manual for SAX parser. 
Here you have:

<?php 
	

  $elementoActual = ""; 
  $elementos          = array(); 
  $identificador = ""; 
		$xml_idioma=0;
    
 function comienzaElemento($parser, $name, $attr) 
 { 

 	global $elementoActual;
 	$elementoActual = $name; 
		
		
	

 } 
    
 function finElemento($parser, $name) 
 { 
    	
        
 } 

    
 function DatoCaracter($parser, $data) 
 { 
    	global $elementos;
    	global $elementoActual;
    	global $identificador;
				global $nodos;
				global $xmlprimernodo;
    	
    		
    	if(ord($data)!=10 && ord($data)!=9 && ord($data)!=13){
    	
					$data=htmlentities($data);
    		if($elementoActual==$xmlprimernodo){
    			$identificador=$data;
    		}
    		
    		//$nodos = array ( 'descripcion', 'dato','id'); //Los nodos definidos en el fichero XML
        foreach ($nodos as $elemento_array) { 
					//echo "<p>!!->$elemento_array</p>";
	    						if ($elementoActual == $elemento_array) { 
    	        		 $elementos[$identificador][$elemento_array] = $data; 
										 

								//echo "<p>$elementos[$identificador][$elemento_array]</p>";
        			  } 
        }	 
    		
 		}
 } 

    
    
 function examinaFichero($xmlSource,$xmlNodes,$xmlFirstNode,$idioma) 
 { 
        global $elementos; 
		global $nodos;
		global $xmlprimernodo;
		
		
				
				//Eliminar cualquier referencia anterior
				
				$elementoActual = null; 
				$elementos=null; 
   				$identificador = null;  
				//----------------
				$nodos=$xmlNodes;
				$xmlprimernodo=$xmlFirstNode;
				//------------
				global $xml_idioma;
     
        $xml_parser = xml_parser_create(); 

					
        xml_parser_set_option ($xml_parser,XML_OPTION_TARGET_ENCODING,"ISO-8859-1");
        xml_parser_set_option ($xml_parser,XML_OPTION_SKIP_WHITE,0);
        xml_parser_set_option ($xml_parser,XML_OPTION_SKIP_TAGSTART,1);
        xml_set_element_handler($xml_parser, "comienzaElemento", "finElemento"); 
        xml_set_character_data_handler($xml_parser, "DatoCaracter"); 
        xml_parser_set_option ($xml_parser, XML_OPTION_CASE_FOLDING, FALSE); 
					


					
       
        if (!($fp = fopen($xmlSource,"r"))) { 
            die("Cannot open $xmlSource."); 
        } 
					
        while (($data = fread($fp,filesize(str_replace("\\","/",$xmlSource))))) {   					
            if (!xml_parse($xml_parser, $data, feof($fp))) { 
							    die (sprintf("XML error at line %d column %d file %s", xml_get_current_line_number($xml_parser), xml_get_current_column_number($xml_parser),$xmlSource)); 
            } 
        } 

        
      xml_parser_free($xml_parser);
				

        return $elementos; 
    }
 [2004-11-22 10:14 UTC] derick@php.net
No, that default charset setting has nothing to do with it, this is for the generation of the HTTP headers only as the documentation describes:
http://no.php.net/manual/en/ini.sect.data-handling.php#ini.default-charset
 [2004-11-26 10:56 UTC] migmam at ya dot coom
Hi Derick,

I've read the manual again. And I've tried the examples changing one character to a non ascii character and it doesn't work.
I have changed all the non ascii characters to their corresponding &#xxx; code and it only works with simplexml (you were right in this case) but not with SAX (using exactly the same xml file). With SAX it still splits the string.

Please, help!

Best regards,

Miguel Angel.
 [2004-12-17 11:20 UTC] migmam at ya dot coom
Hi Derick,

Please, take a look to this source code:

//-------------------------------------------------

$file = "prueba.xml";

function startElement($parser, $name, $attrs)
{

}

function endElement($parser, $name)
{

}

function characterData($parser, $data)
{
	
   echo "->".$data."<-";
}

$xml_parser = xml_parser_create();
xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING,"ISO-8859-1");
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen($file, "r"))) {
   die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
   if (!xml_parse($xml_parser, $data, feof($fp))) {
       die(sprintf("XML error: %s at line %d",
                   xml_error_string(xml_get_error_code($xml_parser)),
                   xml_get_current_line_number($xml_parser)));
   }
}
xml_parser_free($xml_parser);

// XML FILE prueba.xml-------------------------//
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE elements[
					<!ELEMENT element (data)*>
					<!ELEMENT data (#PCDATA)*>
					
]>
<elements>
	<element>
		 <data>bot?n cancelar</data>
	</element>
</elements>

//---------------EXPECTED RESULT-------------------
->  <-->  <-->bot?n cancelar<-->  <--> <-
//---------------ACTUAL OUTPUT---------------------
->  <-->  <-->bot<-->?n cancelar<-->  <--> <-


Best regards,

Miguel Angel.
 [2004-12-17 11:35 UTC] chregu@php.net
There's absolutely nothing wrong with SAX splitting the string. Change your code. It was always the case, that SAX can split the code. Also in PHP 4. If it didn't happen for you, good for you. But it's clearly stated, that SAX *can* split the string, if it thinks it has to. Get over it.

For your other problem. Use

xml_parser_create ( "ISO-8859-1") and you should be able to parse ISO-8859-1 encoded files.

See http://ch.php.net/manual/en/function.xml-parser-create.php and read also the user comments.

Please do not reopen this bug. It isn't a bug.
 [2004-12-17 12:08 UTC] migmam at ya dot coom
It has never happened to me in PHP 4. Same code, same xml file. The only change was PHP version and it is splitting the string exactly before non ascii characters.
I can assume it rewritting my code but I think it is not a good behaviour.

Anyway, thanks for your time.
 
PHP Copyright © 2001-2024 The PHP Group
All rights reserved.
Last updated: Sun Dec 22 11:01:30 2024 UTC