解析HTML(而不是正则表达式)的DOMDocument [英] DOMDocument for parsing HTML (instead of regex)

查看:104
本文介绍了解析HTML(而不是正则表达式)的DOMDocument的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用DOMDocument来解析HTML代码。

I am trying to learn using DOMDocument for parsing HTML code.

我只是在做一些简单的工作,我已经喜欢gordon在使用正则表达式和simplehtmldom的废料数据,并基于我的代码在他的工作。

I am just doing some simple work, I already liked gordon's answer on scrap data using regex and simplehtmldom and based my code on his work.

我发现PHP.net上的文档不由于信息有限,几乎没有示例,大多数细节都是基于解析XML。

I found documentation on PHP.net not that good due to limited information, almost no examples, and most specifics were based on parsing XML.

<?php
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('page'); // would have preferred getContentbyClass('content') (unique) in this case.

# title
print_r($xpath->evaluate('string(div/div/div/div/div/h1)', $contentDiv));

# content (this is not working)
#print_r($xpath->evaluate('string(div/div/div/div['content'])', $contentDiv)); // if only this worked
print_r($xpath->evaluate('string(div/div/div/div)', $contentDiv));
?>

为了测试目的,我试图获得标题(h1标签之间)和内容(HTML)一个nu.nl新闻文章。

For testing purposes I am trying to get the title (between h1 tags) and content (HTML) of a nu.nl news article.

正如你所看到的,我可以得到标题,虽然我甚至不满意评估字符串,因为它恰好是该div上唯一的h1标签,

As you can see I can get the title, although I am not even that happy with that evaluate string since it just happens to be the only h1 tag on that div-level.

推荐答案

以下是DOM和XPath的使用方法:

Here is how you could do it with DOM and XPath:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/…');
libxml_clear_errors();

$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(id("leadarticle")/div/h1)');
echo $dom->saveHtml(
    $xpath->evaluate('id("leadarticle")/div[@class="content"]')->item(0)
);

XPath string(id(leadarticle)/ div / h1) 将返回作为具有id leadarticle的元素的子元素的div的小孩的h1的textContent。

The XPath string(id("leadarticle")/div/h1) will return the textContent of the h1 that is a child of a div that is the child of the element with the id leadarticle.

XPath cid(leadarticle)/ div [@ class =content] 将返回具有id属性元素的子元素的类属性内容的div。

The XPath id("leadarticle")/div[@class="content"] will return the div with the class attribute content that is a child of the element with the id leadarticle.

因为你想要内容div的outerHTML,你必须获取整个节点而不仅仅是内容,因此没有 string()函数。将节点传递到 DOMDocument :: saveHTML() 方法(只有在5.3.6 )然后将该节点序列化回HTML。

Because you want the outerHTML of the content div you'll have to fetch the entire node and not just the content, hence no string() function in the XPath. Passing a node to the DOMDocument::saveHTML() method (which is only possible as of 5.3.6) will then serialize that node back to HTML.

这篇关于解析HTML(而不是正则表达式)的DOMDocument的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆