解析HTML(而不是正则表达式)的DOMDocument [英] DOMDocument for parsing HTML (instead of regex)
问题描述
我正在尝试使用DOMDocument来解析HTML代码。
I am trying to learn using DOMDocument for parsing HTML code.
我只是在做一些简单的工作,我已经喜欢gordon在使用正则表达式和simplehtmldom的废料数据,并基于我的代码在他的工作。
I am just doing some simple work, I already liked gordon's answer on scrap data using regex and simplehtmldom and based my code on his work.
我发现PHP.net上的文档不由于信息有限,几乎没有示例,大多数细节都是基于解析XML。
I found documentation on PHP.net not that good due to limited information, almost no examples, and most specifics were based on parsing XML.
<?php
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html');
libxml_clear_errors();
$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('page'); // would have preferred getContentbyClass('content') (unique) in this case.
# title
print_r($xpath->evaluate('string(div/div/div/div/div/h1)', $contentDiv));
# content (this is not working)
#print_r($xpath->evaluate('string(div/div/div/div['content'])', $contentDiv)); // if only this worked
print_r($xpath->evaluate('string(div/div/div/div)', $contentDiv));
?>
为了测试目的,我试图获得标题(h1标签之间)和内容(HTML)一个nu.nl新闻文章。
For testing purposes I am trying to get the title (between h1 tags) and content (HTML) of a nu.nl news article.
正如你所看到的,我可以得到标题,虽然我甚至不满意评估字符串,因为它恰好是该div上唯一的h1标签,
As you can see I can get the title, although I am not even that happy with that evaluate string since it just happens to be the only h1 tag on that div-level.
推荐答案
以下是DOM和XPath的使用方法:
Here is how you could do it with DOM and XPath:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/…');
libxml_clear_errors();
$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(id("leadarticle")/div/h1)');
echo $dom->saveHtml(
$xpath->evaluate('id("leadarticle")/div[@class="content"]')->item(0)
);
XPath string(id(leadarticle)/ div / h1)
将返回作为具有id leadarticle的元素的子元素的div的小孩的h1的textContent。
The XPath string(id("leadarticle")/div/h1)
will return the textContent of the h1 that is a child of a div that is the child of the element with the id leadarticle.
XPath cid(leadarticle)/ div [@ class =content] 将返回具有id属性元素的子元素的类属性内容的div。
The XPath id("leadarticle")/div[@class="content"]
will return the div with the class attribute content that is a child of the element with the id leadarticle.
因为你想要内容div的outerHTML,你必须获取整个节点而不仅仅是内容,因此没有 string()函数。将节点传递到 DOMDocument :: saveHTML()
方法(只有在5.3.6 )然后将该节点序列化回HTML。
Because you want the outerHTML of the content div you'll have to fetch the entire node and not just the content, hence no string() function in the XPath. Passing a node to the DOMDocument::saveHTML()
method (which is only possible as of 5.3.6) will then serialize that node back to HTML.
这篇关于解析HTML(而不是正则表达式)的DOMDocument的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!