rvest:给定多个列表,返回空节点的 NA [英] rvest: Return NAs for empty nodes given multiple listings

查看:45
本文介绍了rvest:给定多个列表,返回空节点的 NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 R 相当陌生(特别是将它用于网页抓取),因此非常感谢任何帮助.我目前正在尝试挖掘一个网页,其中包含多个门票列表,并列出其中一些门票的其他详细信息(例如视野不佳或仅供儿童使用的门票).我想提取此数据,为不包含这些详细信息的机票列表留下空格或 NA.

由于原网站需要使用RSelenium,我尝试以更简单的形式复制HTML.如果缺少任何信息,请告诉我,我会尽力提供.谢谢!

到目前为止,我已尝试采用此处提供的解决方案:rvest 缺失节点 -->NAhtmlParse 缺失值 NA ,但我无法复制它们例如我获得错误消息

<块引用>

使用方法错误(xml_find_all"):没有适用于xml_find_all"的方法应用于字符"类的对象

我想我确实需要 rvest 和 lapply 的组合,但我似乎无法让它发挥作用.

库(XML)图书馆(rvest)html <- '<!DOCTYPE html><头><身体>

<em><span class="listing_sub3">视野有限</span></em>

<em><span class="listing_sub2">我不感兴趣的其他文字</span></em>

<div class = "listing" id = "listing_3"><div><em><span class="listing_sub3">视野有限</span></em>

<div><span class="listing_sub1">儿童票</span>

</html>'page_html <- read_html(html)孩子 <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%html_text()viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']")%>%html_text()id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>%html_attr( ,name = "id")

我希望得到一个类似这样的表格:

listing child viewLim1英尺2 法郎3 吨

解决方案

此解决方案中的策略是为每个列表节点创建一个节点列表,然后在这些节点中的每一个中搜索所需的信息,子节点和视图受限.

使用 html_node 而不是 html_nodes 将始终返回一个值(即使它只是 NA),这可确保向量长度相同.

此外,对于 rvest,我更喜欢使用 CSS 语法而不是 xpath.在大多数情况下,CSS 比 xpath 表达式更易于使用.

库(rvest)page_html <- read_html(html)#找到listing节点和每个节点的id列表<-html_nodes(page_html,div.listing")列表<-html_attr(列表,名称=id")#在每个列表节点中搜索子票并限制查看条件child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} )viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()})#创建数据框df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))# df# 列出子viewLim#1 列表_1 错误 正确#2 列表_2 FALSE FALSE#3 列表_3 对 对

I am fairly new to R (and using it for web scraping in particular), so any help is greatly appreciated. I am currently trying to mine a webpage that contains multiple ticket listings and lists additional details for some of these (like the ticket having an impaired view or being for children only). I want to extract this data, leaving blank spaces or NAs for the ticket listings that do not contain these details.

Since the original website requires the use of RSelenium, I have tried to replicate the HTML in a simpler form. If any information is missing, please let me know and I will try to provide it. Thanks!

So far, I have tried to adopt the solutions provided here: rvest missing nodes --> NA and htmlParse missing values NA , but am not able to replicate them for my example as I obtain the error message

Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"

I guess I do need a combination of rvest and lapply, but I do not seem to be able to make it work.

library(XML)
library(rvest)

html <- '<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div class = "listing" id = "listing_1">
<em> 
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div class = "listing" id = "listing_2">
<em> 
<span class="listing_sub2">
Other text I am not interested in
</span>
</em>
</div>
<div class = "listing" id = "listing_3">
<div>
<em> 
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div>
<span class="listing_sub1">
Ticket for a child
</span>
</div>
</div>
</body>
</html>'


page_html <- read_html(html)
child <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%
  html_text()
viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']") %>%
  html_text()
id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>% 
  html_attr( ,name = "id") 

I hope to obtain a table that looks similar to this:

listing  child   viewLim
1        F       T       
2        F       F      
3        T       T  

解决方案

The strategy in this solution is to create a list of nodes for each listing node and then search each of those nodes for the desired information, child and view limited.

Using html_node instead of html_nodes will always return a one value (even if it is just NA) this ensures the vector lengths are the same.

Also, with rvest I prefer to use the CSS syntax instead of the xpath. In most cases the CSS is easier to use than the xpath expressions.

library(rvest)

page_html <- read_html(html)
#find the listing nodes and id of each node
listings<-html_nodes(page_html, "div.listing")
listing<-html_attr(listings ,name = "id") 

#search each listing node for the child ticket and limit view criteria
child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} ) 
viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()}) 

#create dataframe
df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))

# df
#    listing child viewLim
#1 listing_1 FALSE    TRUE
#2 listing_2 FALSE   FALSE
#3 listing_3  TRUE    TRUE

这篇关于rvest:给定多个列表,返回空节点的 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆