读与写解析HTML到列表 [英] R Read & Parse HTML to List

查看:37
本文介绍了读与写解析HTML到列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试阅读&解析一些HTML以获得动物收容所中动物的条件列表.我确定我对HTML解析的经验不足无济于事,但我似乎并没有很快获得进展.

I have been trying to read & parse a bit of HTML to obtain a list of conditions for animals at an animal shelter. I'm sure my inexperience with HTML parsing isn't helping, but I seem to be getting no where fast.

这是HTML的摘要:

<select multiple="true" name="asilomarCondition" id="asilomarCondition">

    <option value="101">
        Behavior- Aggression, Confrontational-Toward People (mild)
        -
        TM</option>
....
</select>

只有一个带有<select...>的标签,其余的都是<option value=x>.

There's only one tag with <select...> and the rest are all <option value=x>.

我一直在使用XML库.我可以删除换行符和标签,但是删除标签并没有成功:

I've been using the XML library. I can remove the newlines and tabs, but haven't had any success removing the tags:

conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n")
conditions.text <- gsub('[\t\n]',"",conditions.html)

最终的结果是,我希望列出所有可以进一步处理以供以后用作因子名称的条件:

As a final result, I'd like a list of all of the conditions that I can process further for later use as factor names:

Behavior- Aggression, Confrontational-Toward People (mild)-TM
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU
...

我不确定是否需要使用XML库(或其他库),或者gsub模式是否足够(无论哪种方式,我都需要弄清楚如何使用它).

I'm not sure if I need to use the XML library (or another library) or if gsub patterns would be sufficient (either way, I need to work out how to use it).

推荐答案

以下是使用rvest软件包的开始:

Here is a start using the rvest package:

library(rvest)
#read the html page
page<-read_html("test.html")
#get the text from the "option" nodes and then trim the whitespace
nodes<-trimws(html_text(html_nodes(page, "option")))

#nodes will need additional clean up to remove the excessive spaces 
#and newline characters
nodes<-gsub("\n", "", nodes)
nodes<-gsub("  ", "", nodes)

向量节点应该是您请求的结果.该示例基于上面提供的有限示例,该实际页面可能会产生意外结果.

The vector nodes should be the result which you requested. This example is based on the limited sample provided above, this the actual page may have unexpected results.

这篇关于读与写解析HTML到列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆