在R中解析HTML文件 [英] Parsing HTML file in R

查看:404
本文介绍了在R中解析HTML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网站上读取HTML文件.具体来说,我想从gutenberg.org阅读HTML格式的书籍.每个章节的标题都用标签"h2"标记,并且每个章节的内容都在"h2"之后的段落标签"p"中.使用XML包,我可以获取每个标签的值或完整的HTML代码.

I want to read HTML files from a web site. Specifically, I want to read books in HTML format from gutenberg.org. The title of each chapter is marked with the tag "h2" and the content of each chapter follows in the paragraph tags "p" after the "h2". Using the package XML I am able to get the values or the full HTML code for each tag.

以下是使用George Elliot的Middlemarch的示例代码:

Here is a sample code using George Elliot's Middlemarch:

library(XML)

doc.html = htmlTreeParse('http://www.gutenberg.org/files/145/145-h/145-h.htm',
                         useInternal = TRUE)
doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue)
doc.html.value <- xpathApply(doc.html, '//h2|//p')

doc.value包含一个列表,其中每个元素都是标记的内容,但是我不知道是h2标记还是p标记.另一方面,doc.html.value包含一个列表,其中包含每个标记的html代码.这为我提供了无论是"h2"还是"p"标签的信息,但它还包含了许多我不需要的额外代码(例如样式信息等).

doc.value contains a list where each element is the content of the tags but I cannot know whether is a h2 tag or p tag. On the other hand, doc.html.value contains a list with the html code for each tag. This gives me the information whether it is an "h2" or "p" tag but it also contains a lot of of extra code (like style information, etc) that I don't need.

我的问题:是否有一种简单的方法可以仅获取标签的类型和标签的值,而无需获取与标签相关的其他信息?

My question: Is there a simple way to obtain only the type of the tag and the value of the tag without the other information associated with it?

推荐答案

查看xmlValue的文档表明,还有一个名为xmlName的函数,该函数仅提取标记的名称.使用这两个,可以计算出您想要的:

Looking at the documentation for xmlValue suggests that there is another function by the name of xmlName, which extracts just the name of the tag. Using these two, what you want can be computed:

doc.html.name.value <- xpathApply(doc.html, '//h2|//p', function(x) { list(name=xmlName(x), content=xmlValue(x)); })

> doc.html.name.value[[1]]
$name
[1] "h2"

$content
[1] "\r\nGeorge Eliot\r\n"

这篇关于在R中解析HTML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆