php.net |  support |  documentation |  report a bug |  advanced search |  search howto |  statistics |  random bug |  login
Request #55127 SimpleXML and HTML5 microformat
Submitted: 2011-07-04 07:03 UTC Modified: 2011-07-04 08:09 UTC
Votes:2
Avg. Score:5.0 ± 0.0
Reproduced:1 of 1 (100.0%)
Same Version:0 (0.0%)
Same OS:0 (0.0%)
From: frederic dot auguste at gmail dot com Assigned:
Status: Wont fix Package: SimpleXML related
PHP Version: 5.3.6 OS:
Private report: No CVE-ID: None
View Add Comment Developer Edit
Welcome! If you don't have a Git account, you can't do anything here.
You can add a comment by following this link or if you reported this bug, you can edit this bug over here.
(description)
Block user comment
Status: Assign to:
Package:
Bug Type:
Summary:
From: frederic dot auguste at gmail dot com
New email:
PHP Version: OS:

 

 [2011-07-04 07:03 UTC] frederic dot auguste at gmail dot com
Description:
------------
We would like to manipulate and genere HTML5 microformat.

Parsing a HTML5 microformat with simpleXML is not possible : Some warning are 
generated and simplexml_load_string function return false.

The problem is with the itemscope attribute : It has no value.

Our XML is available on this web site : http://schema.org/Person

Can you add these manipulations in simpleXML API ?
 * add attribut without value
 * parsing XML with attribute without value.

Thanks.

Test script:
---------------
<?php
$xml = <<<XML
<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">Jane Doe</span>
  <img src="janedoe.jpg" itemprop="image" />

  <span itemprop="jobTitle">Professor</span>
  <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
    <span itemprop="streetAddress">
      20341 Whitworth Institute
      405 N. Whitworth
    </span>
    <span itemprop="addressLocality">Seattle</span>,
    <span itemprop="addressRegion">WA</span>
    <span itemprop="postalCode">98052</span>
  </div>
  <span itemprop="telephone">(425) 123-4567</span>
  <a href="mailto:jane-doe@xyz.edu" itemprop="email">
    jane-doe@xyz.edu</a>

  Jane's home page:
  <a href="www.janedoe.com" itemprop="url">janedoe.com</a>

  Graduate students:
  <a href="www.xyz.edu/students/alicejones.html" itemprop="colleagues">
    Alice Jones</a>
  <a href="www.xyz.edu/students/bobsmith.html" itemprop="colleagues">
    Bob Smith</a>
</div>
XML;

$a = simplexml_load_string($xml);

if($a == false) {
	echo "XML not valid"; 
}
else {
	echo $a->asXML();
}



Patches

Add a Patch

Pull Requests

Add a Pull Request

History

AllCommentsChangesGit/SVN commitsRelated reports
 [2011-07-04 08:04 UTC] aharvey@php.net
-Status: Open +Status: Wont fix
 [2011-07-04 08:04 UTC] aharvey@php.net
By definition, it's not valid XML. It's already possible to use SimpleXML to 
manipulate this markup by using DOMDocument::loadHTML() first; eg:

$doc = new DOMDocument;
$doc->loadHTML($xml);
$a = simplexml_import_dom($doc->documentElement);

I don't really see any point complicating the SimpleXML API to support this, 
given the workaround is that easy.
 [2011-07-04 08:09 UTC] pajoye@php.net
Additionally you can use tidy to create somehow valid xhtml out of a broken html 
input.
 [2012-09-14 18:27 UTC] blasterdrp at gmail dot com
"I don't really see any point complicating the SimpleXML API to support this, 
given the workaround is that easy."

The DOMDocument class is disrespectful toward input HTML and makes a lot of assumptions that you can't persuade it from making. It self-terminates tags you may not want to self-terminate, such as <script/>, or leaves open tags you may want to close, such as <li>, depending on whether or not you put it in quirks mode or strict mode. Furthermore it also arbitrarily adds DTDs if it doesn't like the one you already have (<!DOCTYPE html> is acceptable for HTML 5), and the same with meta tags, and so forth.

The DOMDocument should only change what you tell it to change, but instead it changes everything, and there's no way to tell it not to.
 
PHP Copyright © 2001-2020 The PHP Group
All rights reserved.
Last updated: Tue Oct 27 12:01:22 2020 UTC