简化PHP DOM XML解析 - 如何? [英] Simplify PHP DOM XML parsing - how?

查看:81
本文介绍了简化PHP DOM XML解析 - 如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经花了整整一天PHP的DOM功能,但我不明白它是如何工作的。 (
我有一个简单的XML文件看起来不错,但我不能使用它,当我创建它的结构时我觉得如何。



示例XML片段:

  -pages //根元素
-page id =1//我们可以有任意数量的页面
-product id =364826//我们可以有任何数量的产品
-SOME_KIND_OF_VALUE
-ANOTHER_VALUE
...

我最初的想法是加快客户端的工作流程,所以我抛出旧的CSV并开始使用XML。



问题1:
当我将产品分组到页面时,我使用 setIdAttribute 以防止在树中存储同一页面多次,这样工作正常,直到阅读发生,因为这些id被绑定到某种DTD(基于 getElementById )。



问题1:
我如何写一个简单的DTD提供这些必要的信息,所以我可以在阅读阶段使用 getElementById



问题2: / strong>
因为我有我想要加载的信息,因为我可以。这就是为什么我在页面上创建了id属性。现在我无法直接访问我的页面id =2,因为上面的问题1( getElementById 目前无用)。不知怎的,我可以设法检索给定页面上每个产品的必要信息,但我的代码看起来很吓人:

  $ doc = DOMDocument ::负载( 'data.xml中'); 
$ xpath = new DOMXPath($ doc);
$ query ='/ pages / page ['。 $页。 ]; // $ page is fine:设置较早
$ products = $ xpath-> query($ query);
$ _prods = $ doc-> getElementsByTagName('product');
foreach($ _ prods as $ product){
foreach($ product-> childNodes as $ node){
echo $ node-> nodeName。 :。 $ node-> nodeValue。 < br />;
}
}

任务2:
我认为上面的代码是关于解析XML的例子。但是由于我对PHP的DOM功能的有限的了解,我无法自己写一个更干净的功能。我尝试了一些微不足道的解决方案,但没有一个对我有用。



请帮助我,如果可以的话。



谢谢,
fabrik

解决方案

解决问题1:



W3C定义属性 xml:id 作为XML文档中的ID属性的含义,并定义处理此属性以识别ID没有验证,没有获取外部资源,而不依赖于内部子集。



换句话说,当你使用

  $ element-> setAttribute('xml:id','test'); 

您不需要调用 setIdAttribute 也不要指定DTD或Schema。当与 getElementById 一起使用时,DOM将识别 xml:id 属性,而无需验证文档或任何内容。这是最不方便的方法。请注意,这取决于您的操作系统和libxml版本,您将无法获取 getElementById 。所有。



解决问题2:



即使ID不能通过 getElementById 获取ID,您仍然可以使用XPath非常方便地抓取它们:

  $ xpath-> query('/ pages / page [@ id = 1] ); 

一定会奏效。您还可以直接获取产品儿童的特定页面:

  $ xpath-> query('// pages /页面[@ ID = 1] /产品); 

除此之外,您可以做的很少,使DOM代码看起来不那么冗长,因为它真的是一个详细的界面。它必须是,因为 DOM是一种语言不可知的界面,再次由W3C定义






以下评论后编辑



正如我上面解释的那样工作。这是一个完整的测试用例。第一部分是为了使用DOM编写新的XML文件。那就是你需要设置 xml:id 属性。你使用这个而不是普通的,非命名空间的id属性。

  //设置
$ dom = new DOM文档;
$ dom-> formatOutput = TRUE;
$ dom-> preserveWhiteSpace = FALSE;
$ dom-> loadXML('< pages />');

//如果在不使用DTD或Schema的情况下如何设置有效的id属性
$ page1 = $ dom-> createElement('page');
$ page1-> setAttribute('xml:id','p1');
$ page1-> appendChild($ dom-> createElement('product','foo1'));
$ page1-> appendChild($ dom-> createElement('product','foo2'));

//如何设置重新加载时需要DTD或Schema的ID属性
$ page2 = $ dom-> createElement('page');
$ page2-> setAttribute('id','p2');
$ page2-> setIdAttribute('id',TRUE);
$ page2-> appendChild($ dom-> createElement('product','bar1'));
$ page2-> appendChild($ dom-> createElement('product','bar2'));

//附加页面并保存XML
$ dom-> documentElement-> appendChild($ page1);
$ dom-> documentElement-> appendChild($ page2);
$ xml = $ dom-> saveXML();
unset($ dom,$ page1,$ page2);
echo $ xml;

这将创建一个这样的XML文件:

 <?xml version =1.0?> 
< pages>
< page xml:id =p1>
< product> foo1< / product>
< product> foo2< / product>
< / page>
< page id =p2>
< product> bar1< / product>
< product> bar2< / product>
< / page>
< / pages>

当您再次在XML中读取时,新的DOM实例不再知道您已将非命名空间的 id 属性声明为具有 setIdAttribute 的I​​D属性。它仍然在XML中,但id属性将只是一个常规属性。 您必须意识到ID属性在XML中是特别的。

  //加载我们上面创建的XML 
$ dom = new DOMDocument;
$ dom-> loadXML($ xml);

现在进行一些测试:

  echo\\\
\\\
GETELEMENTBYID使用XML返回元素:ID \\\
\\\
;
foreach($ dom-> getElementById('p1') - > childNodes as $ product){
echo $ product-> nodeValue; //将输出foo1和foo2与空格
}

以上工作原理是因为DOM兼容的解析器必须识别 xml:id 是一个ID属性,无论任何DTD或Schema。这在上面链接的规范中解释。
它输出空格的原因是因为格式化输出,在开始标签,两个产品标签和结束标签之间有DOMText节点,所以我们在五个节点上迭代。节点概念在使用XML时非常重要。

  echo\\\
\\\
GETELEMENTBYID can not FETCH NORMAL ID \\ \
\\\
;
foreach($ dom-> getElementById('p2') - > childNodes as $ product){
echo $ product-> nodeValue; //会输出一个NOTICE和一个WARNING
}

以上将不起作用,因为 id 不是ID属性。为了让DOM解析器识别它,您需要一个DTD或者Schema,而XML必须经过验证。

  echo \\\
\\\
XPATH CAN FETCH NORMAL ID \\\
\\\
;
$ xPath = new DOMXPath($ dom);
$ page2 = $ xPath-> query('/ pages / page [@ id =p2]') - > item(0);
foreach($ page2-> childNodes as $ product){
echo $ product-> nodeValue; //将输出bar1和bar2
}

XPath另一方面是关于属性,这意味着如果 getElementById 不可用,您可以向属性 id 的页面元素查询DOM。请注意,要查询ID为p1的页面,您必须包括命名空间,例如 @xml:id =p1

  echo\ n \\ n \\ n XPATH CAN FETCH PRODUCTS for PAGE with ID \\\
\\\
;
$ xPath = new DOMXPath($ dom);
foreach($ xPath-> query('/ pages / page [@ id =p2] / product')as $ product){
echo $ product-> nodeValue; //将输出bar1和bar2 w\out空格
}

同样,您也可以使用XPath查询文档中的其他任何内容。这不会输出空格,因为它只返回id为p2的页面下方的产品元素。



您还可以从节点遍历整个DOM。这是一个树结构。由于 DOMNode 是DOM中最重要的类,你想要熟悉它。

  echo\\\
\\\
TRAVERSING UP AND DOWN \\\
\ N;
$ product = $ dom-> getElementsByTagName('product') - > item(2);
echo $ product-> tagName; //'product'
echo $ dom-> saveXML($ product); //'< product> bar1< / product>'

//从bar1到foo1
$ product = $ product-> parentNode //页面节点
- > parentNode // Pages Node
- > childNodes-> item(1)//页面p1
- > childNodes-> item(1); // 1st Product

echo $ product-> nodeValue; //'foo1'

//从foo1到foo2它是两个(!)节点,因为XML被格式化为
echo $ product-> nextSibling-> nodeName; //'#text'with whitespace and linebreak
echo $ product-> nextSibling-> nextSibling-> nodeName; //'product'
echo $ product-> nextSibling-> nextSibling-> nodeValue; //'foo2'

在sidenote上,是的,我在上面的原始代码中有打字错误。 产品不是产品。但是,当您需要更改的所有代码都是一个 s 时,我发现这些代码不起作用。那只是觉得太像勺子了。


I've spent whole days with PHP's DOM functions but i can't understand how it works yet. :( I have a simple XML file that looks okay but i cannot use it how i think when i've created it's structure.

Sample XML fragment:

-pages //root element
    -page id="1" //we can have any number of pages
        -product id="364826" //we can have any number of products
            -SOME_KIND_OF_VALUE
            -ANOTHER_VALUE
            ...

My original idea was to speed up my client's workflow so i throw out old CSVs and started using XMLs.

Problem 1: When i grouping products into page i'm using setIdAttribute to prevent storing the same page in the tree more than once. This works fine until reading happens because these id's are tied to some kind of DTD's (based on getElementById).

Question 1: How can i write a simple DTD which provides these necessary informations so i can use getElementById at the reading phase too?

Problem 2: Because i have pages i'd like to load as less information as i can. That was why i created the id attribute on pages. Now i cannot access my page id="2" directly because Problem 1 above (getElementById makes no sense currently). Somehow i can managed to retrieve the necessary informations about each product on a given page but my code looks scary:

$doc      = DOMDocument::load('data.xml');
$xpath    = new DOMXPath($doc);
$query    = '/pages/page[' . $page . ']'; //$page is fine: was set earlier
$products = $xpath->query($query);
$_prods   = $doc->getElementsByTagName('product');
foreach($_prods as $product){
    foreach($product->childNodes as $node){
        echo $node->nodeName . ": " . $node->nodeValue . "<br />";
    }
}

Queston 2: I think the code above is the example about how not to parse an XML. But because of my limited knowledge of PHP's DOM functions i cannot write a cleaner one by myself. I tried some trivial solution but none of them worked for me.

Please help me if you can.

Thanks, fabrik

解决方案

Solving Problem 1:

The W3C defines: the meaning of the attribute xml:id as an ID attribute in XML documents and defines processing of this attribute to identify IDs in the absence of validation, without fetching external resources, and without relying on an internal subset.

In other words, when you use

$element->setAttribute('xml:id', 'test');

you do not need to call setIdAttribute, nor specify a DTD or Schema. DOM will recognize the xml:id attribute when used with getElementById without you having to validate the document or anything. This is the least effort approach. Note though, that depending on your OS and version of libxml, you wont get getElementById to work at all.

Solving Problem2:

Even with IDs not being fetchable with getElementById, you can still very much fetch them with XPath:

$xpath->query('/pages/page[@id=1]');

would definitely work. And you can also fetch the product children for a specific page directly:

$xpath->query('//pages/page[@id=1]/products');

Apart from this, there is very little you can do to make DOM code look less verbose, because it really is a verbose interface. It has to be, because DOM is a language agnostic interface, again defined by the W3C.


EDIT after comment below

It is working like I explained above. Here is a full test case for you. The first part is for writing new XML files with DOM. That is where you need to set the xml:id attribute. You use this instead of the regular, non-namespaced, id attribute.

// Setup
$dom = new DOMDocument;
$dom->formatOutput = TRUE;
$dom->preserveWhiteSpace = FALSE;
$dom->loadXML('<pages/>');

// How to set a valid id attribute when not using a DTD or Schema
$page1 = $dom->createElement('page');
$page1->setAttribute('xml:id', 'p1');
$page1->appendChild($dom->createElement('product', 'foo1'));
$page1->appendChild($dom->createElement('product', 'foo2'));

// How to set an ID attribute that requires a DTD or Schema when reloaded
$page2 = $dom->createElement('page');
$page2->setAttribute('id', 'p2');
$page2->setIdAttribute('id', TRUE);
$page2->appendChild($dom->createElement('product', 'bar1'));
$page2->appendChild($dom->createElement('product', 'bar2'));

// Appending pages and saving XML
$dom->documentElement->appendChild($page1);
$dom->documentElement->appendChild($page2);
$xml = $dom->saveXML();
unset($dom, $page1, $page2);
echo $xml;

This will create an XML file like this:

<?xml version="1.0"?>
<pages>
  <page xml:id="p1">
    <product>foo1</product>
    <product>foo2</product>
  </page>
  <page id="p2">
    <product>bar1</product>
    <product>bar2</product>
  </page>
</pages>

When you read in the XML again, the new DOM instance no longer knows you have declared the non-namespaced id attribute as ID attribute with setIdAttribute. It will still be in the XML, but id attribute will just be a regular attribute. You have to be aware that ID attributes are special in XML.

// Load the XML we created above
$dom = new DOMDocument;
$dom->loadXML($xml);

Now for some tests:

echo "\n\n GETELEMENTBYID RETURNS ELEMENT WITH XML:ID \n\n";
foreach( $dom->getElementById('p1')->childNodes as $product) {
    echo $product->nodeValue; // Will output foo1 and foo2 with whitespace
}

The above works, because a DOM compliant parser has to recognize xml:id is an ID attribute, regardless of any DTD or Schema. This is explained in the specs linked above. The reason it outputs whitespace is because due to the formatted output there is DOMText nodes between the opening tag, the two product tags and the closing tags, so we are iterating over five nodes. The node concept is crucial to understand when working with XML.

echo "\n\n GETELEMENTBYID CANNOT FETCH NORMAL ID \n\n";
foreach( $dom->getElementById('p2')->childNodes as $product) {
    echo $product->nodeValue; // Will output a NOTICE and a WARNING
}

The above will not work, because id is not an ID attribute. For the DOM parser to recognize it as such, you need a DTD or Schema and the XML must be validated against it.

echo "\n\n XPATH CAN FETCH NORMAL ID \n\n";
$xPath = new DOMXPath($dom);
$page2 = $xPath->query('/pages/page[@id="p2"]')->item(0);
foreach( $page2->childNodes as $product) {
    echo $product->nodeValue; // Will output bar1 and bar2
}

XPath on the other hand is literal about the attributes, which means you can query the DOM for the page element with attribute id if getElementById is not available. Note that to query the page with ID p1, you'd have to include the namespace, e.g. @xml:id="p1".

echo "\n\n XPATH CAN FETCH PRODUCTS FOR PAGE WITH ID \n\n";
$xPath = new DOMXPath($dom);
foreach( $xPath->query('/pages/page[@id="p2"]/product') as $product ) {
    echo $product->nodeValue; // Will output bar1 and bar2 w\out whitespace
}

And like said, you can also use XPath to query anything else in the document. This one will not output whitespace, because it will only return the product elements below the page with id p2.

You can also traverse the entire DOM from a node. It's a tree structure. Since DOMNode is the most important class in DOM, you want to familiarize yourself with it.

echo "\n\n TRAVERSING UP AND DOWN \n\n";
$product = $dom->getElementsByTagName('product')->item(2);
echo $product->tagName; // 'product'
echo $dom->saveXML($product); // '<product>bar1</product>'

// Going from bar1 to foo1
$product = $product->parentNode // Page Node
                   ->parentNode // Pages Node
                   ->childNodes->item(1)  // Page p1
                   ->childNodes->item(1); // 1st Product

echo $product->nodeValue; // 'foo1'

// from foo1 to foo2 it is two(!) nodes because the XML is formatted
echo $product->nextSibling->nodeName; // '#text' with whitespace and linebreak
echo $product->nextSibling->nextSibling->nodeName; // 'product'
echo $product->nextSibling->nextSibling->nodeValue; // 'foo2'

On a sidenote, yes, I do have a typo in the original code above. It's product not products. But I find it hardly justified to claim the code does not work when all you have to change is an s. That just feels too much like wanting to be spoonfed.

这篇关于简化PHP DOM XML解析 - 如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆