rvest:给定多个列表,返回空节点的 NA [英] rvest: Return NAs for empty nodes given multiple listings
问题描述
我对 R 相当陌生(特别是将它用于网页抓取),因此非常感谢任何帮助.我目前正在尝试挖掘一个网页,其中包含多个门票列表,并列出其中一些门票的其他详细信息(例如视野不佳或仅供儿童使用的门票).我想提取此数据,为不包含这些详细信息的机票列表留下空格或 NA.
由于原网站需要使用RSelenium,我尝试以更简单的形式复制HTML.如果缺少任何信息,请告诉我,我会尽力提供.谢谢!
到目前为止,我已尝试采用此处提供的解决方案:rvest 缺失节点 -->NA 和 htmlParse 缺失值 NA ,但我无法复制它们例如我获得错误消息
<块引用>使用方法错误(xml_find_all"):没有适用于xml_find_all"的方法应用于字符"类的对象
我想我确实需要 rvest 和 lapply 的组合,但我似乎无法让它发挥作用.
库(XML)图书馆(rvest)html <- '<!DOCTYPE html><头>头部><身体><em><span class="listing_sub3">视野有限</span></em><em><span class="listing_sub2">我不感兴趣的其他文字</span></em><div class = "listing" id = "listing_3"><div><em><span class="listing_sub3">视野有限</span></em>
<div><span class="listing_sub1">儿童票</span>
</html>'page_html <- read_html(html)孩子 <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%html_text()viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']")%>%html_text()id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>%html_attr( ,name = "id")
我希望得到一个类似这样的表格:
listing child viewLim1英尺2 法郎3 吨
此解决方案中的策略是为每个列表节点创建一个节点列表,然后在这些节点中的每一个中搜索所需的信息,子节点和视图受限.
使用 html_node 而不是 html_nodes 将始终返回一个值(即使它只是 NA),这可确保向量长度相同.
此外,对于 rvest
,我更喜欢使用 CSS 语法而不是 xpath.在大多数情况下,CSS 比 xpath 表达式更易于使用.
库(rvest)page_html <- read_html(html)#找到listing节点和每个节点的id列表<-html_nodes(page_html,div.listing")列表<-html_attr(列表,名称=id")#在每个列表节点中搜索子票并限制查看条件child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} )viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()})#创建数据框df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))# df# 列出子viewLim#1 列表_1 错误 正确#2 列表_2 FALSE FALSE#3 列表_3 对 对
I am fairly new to R (and using it for web scraping in particular), so any help is greatly appreciated. I am currently trying to mine a webpage that contains multiple ticket listings and lists additional details for some of these (like the ticket having an impaired view or being for children only). I want to extract this data, leaving blank spaces or NAs for the ticket listings that do not contain these details.
Since the original website requires the use of RSelenium, I have tried to replicate the HTML in a simpler form. If any information is missing, please let me know and I will try to provide it. Thanks!
So far, I have tried to adopt the solutions provided here: rvest missing nodes --> NA and htmlParse missing values NA , but am not able to replicate them for my example as I obtain the error message
Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"
I guess I do need a combination of rvest and lapply, but I do not seem to be able to make it work.
library(XML)
library(rvest)
html <- '<!DOCTYPE html>
<html>
<head>
</head>
<body>
<div class = "listing" id = "listing_1">
<em>
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div class = "listing" id = "listing_2">
<em>
<span class="listing_sub2">
Other text I am not interested in
</span>
</em>
</div>
<div class = "listing" id = "listing_3">
<div>
<em>
<span class="listing_sub3">
Limited view
</span>
</em>
</div>
<div>
<span class="listing_sub1">
Ticket for a child
</span>
</div>
</div>
</body>
</html>'
page_html <- read_html(html)
child <- html_nodes(page_html, xpath ="//*[@class='listing_sub1']") %>%
html_text()
viewLim <- html_nodes(page_html, xpath ="//*[@class='listing_sub3']") %>%
html_text()
id <- html_nodes(page_html, xpath = "//*[@class='listing']") %>%
html_attr( ,name = "id")
I hope to obtain a table that looks similar to this:
listing child viewLim
1 F T
2 F F
3 T T
The strategy in this solution is to create a list of nodes for each listing node and then search each of those nodes for the desired information, child and view limited.
Using html_node instead of html_nodes will always return a one value (even if it is just NA) this ensures the vector lengths are the same.
Also, with rvest
I prefer to use the CSS syntax instead of the xpath. In most cases the CSS is easier to use than the xpath expressions.
library(rvest)
page_html <- read_html(html)
#find the listing nodes and id of each node
listings<-html_nodes(page_html, "div.listing")
listing<-html_attr(listings ,name = "id")
#search each listing node for the child ticket and limit view criteria
child<-sapply(listings, function(x) {html_node(x, "span.listing_sub1") %>% html_text()} )
viewLim<-sapply(listings, function(x) {html_node(x, "span.listing_sub3") %>% html_text()})
#create dataframe
df<-data.frame(listing, child=!is.na(child), viewLim=!is.na(viewLim))
# df
# listing child viewLim
#1 listing_1 FALSE TRUE
#2 listing_2 FALSE FALSE
#3 listing_3 TRUE TRUE
这篇关于rvest:给定多个列表,返回空节点的 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!