rvest:给定多个列表，返回空节点的 NA [英] rvest: Return NAs for empty nodes given multiple listings

查看：45 发布时间：2021/7/14 18:34:52 r web-scraping rvest

本文介绍了rvest:给定多个列表，返回空节点的 NA的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对 R 相当陌生(特别是将它用于网页抓取)，因此非常感谢任何帮助.我目前正在尝试挖掘一个网页，其中包含多个门票列表，并列出其中一些门票的其他详细信息(例如视野不佳或仅供儿童使用的门票).我想提取此数据，为不包含这些详细信息的机票列表留下空格或 NA.

由于原网站需要使用RSelenium，我尝试以更简单的形式复制HTML.如果缺少任何信息，请告诉我，我会尽力提供.谢谢！

到目前为止，我已尝试采用此处提供的解决方案:rvest 缺失节点 -->NA 和 htmlParse 缺失值 NA ，但我无法复制它们例如我获得错误消息

<块引用>

使用方法错误(xml_find_all"):没有适用于xml_find_all"的方法应用于字符"类的对象

我想我确实需要 rvest 和 lapply 的组合，但我似乎无法让它发挥作用.

库(XML)图书馆(rvest)html <- '<!DOCTYPE html><头><身体>
<em><span class="listing_sub3">视野有限</span></em>
<em><span class="listing_sub2">我不感兴趣的其他文字</span></em>
<div class = "listing" id = "listing_3"><div><em><span class="listing_sub3">视野有限</span></em>

<div><span class="listing_sub1">儿童票</span>

</html>'page_html <- read_html(html)孩子 <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%html_text()viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']")%>%html_text()id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>%html_attr( ,name = "id")

我希望得到一个类似这样的表格:

listing child viewLim1英尺2 法郎3 吨

解决方案

此解决方案中的策略是为每个列表节点创建一个节点列表，然后在这些节点中的每一个中搜索所需的信息，子节点和视图受限.

使用 html_node 而不是 html_nodes 将始终返回一个值(即使它只是 NA)，这可确保向量长度相同.

此外，对于 rvest，我更喜欢使用 CSS 语法而不是 xpath.在大多数情况下，CSS 比 xpath 表达式更易于使用.

库(rvest)page_html <- read_html(html)#找到listing节点和每个节点的id列表<-html_nodes(page_html，div.listing")列表<-html_attr(列表，名称=id")#在每个列表节点中搜索子票并限制查看条件child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} )viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()})#创建数据框df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))# df# 列出子viewLim#1 列表_1 错误 正确#2 列表_2 FALSE FALSE#3 列表_3 对 对

I am fairly new to R (and using it for web scraping in particular), so any help is greatly appreciated. I am currently trying to mine a webpage that contains multiple ticket listings and lists additional details for some of these (like the ticket having an impaired view or being for children only). I want to extract this data, leaving blank spaces or NAs for the ticket listings that do not contain these details.

Since the original website requires the use of RSelenium, I have tried to replicate the HTML in a simpler form. If any information is missing, please let me know and I will try to provide it. Thanks!

So far, I have tried to adopt the solutions provided here: rvest missing nodes --> NA and htmlParse missing values NA , but am not able to replicate them for my example as I obtain the error message

Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"

I guess I do need a combination of rvest and lapply, but I do not seem to be able to make it work.

library(XML)
library(rvest)

html <- '<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div class = "listing" id = "listing_1">
<em> 
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div class = "listing" id = "listing_2">
<em> 
<span class="listing_sub2">
Other text I am not interested in
</span>
</em>
</div>
<div class = "listing" id = "listing_3">
<div>
<em> 
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div>
<span class="listing_sub1">
Ticket for a child
</span>
</div>
</div>
</body>
</html>'


page_html <- read_html(html)
child <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%
  html_text()
viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']") %>%
  html_text()
id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>% 
  html_attr( ,name = "id")

I hope to obtain a table that looks similar to this:

listing  child   viewLim
1        F       T       
2        F       F      
3        T       T

解决方案

The strategy in this solution is to create a list of nodes for each listing node and then search each of those nodes for the desired information, child and view limited.

Using html_node instead of html_nodes will always return a one value (even if it is just NA) this ensures the vector lengths are the same.

Also, with rvest I prefer to use the CSS syntax instead of the xpath. In most cases the CSS is easier to use than the xpath expressions.

library(rvest)

page_html <- read_html(html)
#find the listing nodes and id of each node
listings<-html_nodes(page_html, "div.listing")
listing<-html_attr(listings ,name = "id") 

#search each listing node for the child ticket and limit view criteria
child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} ) 
viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()}) 

#create dataframe
df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))

# df
#    listing child viewLim
#1 listing_1 FALSE    TRUE
#2 listing_2 FALSE   FALSE
#3 listing_3  TRUE    TRUE

这篇关于rvest:给定多个列表，返回空节点的 NA的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

rvest:给定多个列表，返回空节点的 NA [英] rvest: Return NAs for empty nodes given multiple listings

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

rvest:给定多个列表，返回空节点的 NA [英] rvest: Return NAs for empty nodes given multiple listings

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭