解析HTML（而不是正则表达式）的DOMDocument [英] DOMDocument for parsing HTML (instead of regex)

查看：104 发布时间：2017/6/24 22:24:11 php parsing dom xpath

本文介绍了解析HTML（而不是正则表达式）的DOMDocument的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用DOMDocument来解析HTML代码。

I am trying to learn using DOMDocument for parsing HTML code.

我只是在做一些简单的工作，我已经喜欢gordon在使用正则表达式和simplehtmldom的废料数据，并基于我的代码在他的工作。

I am just doing some simple work, I already liked gordon's answer on scrap data using regex and simplehtmldom and based my code on his work.

我发现PHP.net上的文档不由于信息有限，几乎没有示例，大多数细节都是基于解析XML。

I found documentation on PHP.net not that good due to limited information, almost no examples, and most specifics were based on parsing XML.

<?php
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/internet/1106541/taalunie-keurt-open-sourcewoordenlijst-goed.html');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('page'); // would have preferred getContentbyClass('content') (unique) in this case.

# title
print_r($xpath->evaluate('string(div/div/div/div/div/h1)', $contentDiv));

# content (this is not working)
#print_r($xpath->evaluate('string(div/div/div/div['content'])', $contentDiv)); // if only this worked
print_r($xpath->evaluate('string(div/div/div/div)', $contentDiv));
?>

为了测试目的，我试图获得标题（h1标签之间）和内容（HTML）一个nu.nl新闻文章。

For testing purposes I am trying to get the title (between h1 tags) and content (HTML) of a nu.nl news article.

正如你所看到的，我可以得到标题，虽然我甚至不满意评估字符串，因为它恰好是该div上唯一的h1标签，

As you can see I can get the title, although I am not even that happy with that evaluate string since it just happens to be the only h1 tag on that div-level.

推荐答案

以下是DOM和XPath的使用方法：

Here is how you could do it with DOM and XPath:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('http://www.nu.nl/…');
libxml_clear_errors();

$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(id("leadarticle")/div/h1)');
echo $dom->saveHtml(
    $xpath->evaluate('id("leadarticle")/div[@class="content"]')->item(0)
);

XPath string（id（leadarticle）/ div / h1） 将返回作为具有id leadarticle的元素的子元素的div的小孩的h1的textContent。

The XPath string(id("leadarticle")/div/h1) will return the textContent of the h1 that is a child of a div that is the child of the element with the id leadarticle.

XPath cid（leadarticle）/ div [@ class =content] 将返回具有id属性元素的子元素的类属性内容的div。

The XPath id("leadarticle")/div[@class="content"] will return the div with the class attribute content that is a child of the element with the id leadarticle.

因为你想要内容div的outerHTML，你必须获取整个节点而不仅仅是内容，因此没有 string（）函数。将节点传递到 DOMDocument :: saveHTML（） 方法（只有在5.3.6 ）然后将该节点序列化回HTML。

Because you want the outerHTML of the content div you'll have to fetch the entire node and not just the content, hence no string() function in the XPath. Passing a node to the DOMDocument::saveHTML() method (which is only possible as of 5.3.6) will then serialize that node back to HTML.

这篇关于解析HTML（而不是正则表达式）的DOMDocument的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析HTML（而不是正则表达式）的DOMDocument [英] DOMDocument for parsing HTML (instead of regex)

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

解析HTML（而不是正则表达式）的DOMDocument [英] DOMDocument for parsing HTML (instead of regex)

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭