R:webscrapping 没有为应用于“字符"类对象的“xml_find_all"返回任何适用的方法? [英] R: webscrapping returns no applicable method for 'xml_find_all' applied to an object of class "character"?
问题描述
我正在使用 RSelenium 和 purrr 函数来生成包含此页面中所有产品及其价格的 df:
更新 2:
尝试过:
h<-remDr$getPageSource()[[1]]hh <- h %>% read_html() %>% html_elements(div.product")class(hh) #[1] "xml_nodeset";
但是在尝试形成 df 时得到这个:
data.frame 中的错误(periodo = lubridate::year(Sys.Date()), fecha = Sys.Date(), :参数意味着不同的行数:1, 0
使用 remDr$getPageSource()[[1]]
获取实际文档.
然后您需要将其通过管道传输到您的 DOM 解析器,即 remDr$getPageSource()[[1]] %>% read_html()
并像以前一样继续,即 ...%>% html_elements(.....)
.
RSelenium 有自己的方法通过 Webdriver 实例选择元素,例如remDr$findElement(css", body")
.在您的情况下,您选择将 html 转换为可以调用 rvest 的 html_nodes()
的内容,即文档、节点集或单个节点..由于传输的是html,所以需要read_html()
生成解析文档.
尝试形成 data.frame
调用中的错误是因为您需要实现对缺失子节点的处理,即某些价格缺失的地方.
I'm using RSelenium and purrr functions to generate a df with all the products in this page and their prices:
https://www.lacuracao.pe/curacao/tv-y-audio/televisores
I'm getting this error, why?
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
Code:
library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
#start RSelenium
rD <- rsDriver(port = 4560L, browser = "chrome", version = "3.141.59", chromever = "93.0.4577.63",
geckover = "latest", iedrver = NULL, phantomver = "2.1.1",
verbose = TRUE, check = TRUE)
remDr <- rD[["client"]]
Sys.sleep(10)
tvs_url <- "https://www.lacuracao.pe/curacao/tv-y-audio/televisores"
remDr$navigate(tvs_url)
Sys.sleep(10)
#scroll down 20 times, waiting for the page to load at each time
for(i in 1:20){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(5)
}
h<-remDr$getPageSource()
df <- map_dfr(h %>%
map(~ .x %>%
html_nodes("div.product")), ~
data.frame(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "lacuracao",
producto = .x %>% html_node(".product_name") %>% html_text(),
precio.antes = .x %>% html_node('.old-price') %>% html_text(),
precio.actual = .x %>% html_node('#offerPriceValue') %>% html_text()
))
Update 1:
I've changed h<-remDr$getPageSource()
to h<-remDr$getPageSource()[[1]]
and now class(h)
returns character.
Update 2:
Tried:
h<-remDr$getPageSource()[[1]]
hh <- h %>% read_html() %>% html_elements("div.product")
class(hh) #[1] "xml_nodeset"
But getting this when trying to form the df:
Error in data.frame(periodo = lubridate::year(Sys.Date()), fecha = Sys.Date(), :
arguments imply differing number of rows: 1, 0
Use remDr$getPageSource()[[1]]
to get the actual document.
You then need to pipe that to your DOM parser i.e. remDr$getPageSource()[[1]] %>% read_html()
and continue on as before i.e. ...%>% html_elements(.....)
.
RSelenium has its own methods for selecting elements via the Webdriver instance e.g. remDr$findElement("css", "body")
. In your case, you are choosing to transfer the html across into something which you can call rvest's html_nodes()
on i.e.
either a document, a node set or a single node.. As the transfer is html, then read_html()
is needed to generate a document for parsing.
The error inside the attempt to form a data.frame
call is because you need to implement handling of missing child nodes i.e. where certain prices are missing.
这篇关于R:webscrapping 没有为应用于“字符"类对象的“xml_find_all"返回任何适用的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!