R:webscrapping 没有为应用于“字符"类对象的“xml_find_all"返回任何适用的方法? [英] R: webscrapping returns no applicable method for 'xml_find_all' applied to an object of class "character"?

查看:35
本文介绍了R:webscrapping 没有为应用于“字符"类对象的“xml_find_all"返回任何适用的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 RSelenium 和 purrr 函数来生成包含此页面中所有产品及其价格的 df:

更新 2:

尝试过:

h<-remDr$getPageSource()[[1]]hh <- h %>% read_html() %>% html_elements(div.product")class(hh) #[1] "xml_nodeset";

但是在尝试形成 df 时得到这个:

data.frame 中的错误(periodo = lubridate::year(Sys.Date()), fecha = Sys.Date(), :参数意味着不同的行数:1, 0

解决方案

使用 remDr$getPageSource()[[1]] 获取实际文档.

然后您需要将其通过管道传输到您的 DOM 解析器,即 remDr$getPageSource()[[1]] %>% read_html() 并像以前一样继续,即 ...%>% html_elements(.....).

RSelenium 有自己的方法通过 Webdriver 实例选择元素,例如remDr$findElement(css", body").在您的情况下,您选择将 html 转换为可以调用 rvest 的 html_nodes() 的内容,即文档、节点集或单个节点..由于传输的是html,所以需要read_html()生成解析文档.

尝试形成 data.frame 调用中的错误是因为您需要实现对缺失子节点的处理,即某些价格缺失的地方.

I'm using RSelenium and purrr functions to generate a df with all the products in this page and their prices:

https://www.lacuracao.pe/curacao/tv-y-audio/televisores

I'm getting this error, why?

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"

Code:

library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
library(purrr)


#start RSelenium


rD  <- rsDriver(port = 4560L, browser = "chrome", version = "3.141.59", chromever = "93.0.4577.63",
                geckover = "latest", iedrver = NULL, phantomver = "2.1.1",
                verbose = TRUE, check = TRUE)



remDr <- rD[["client"]]


Sys.sleep(10)

tvs_url <- "https://www.lacuracao.pe/curacao/tv-y-audio/televisores"

remDr$navigate(tvs_url)

Sys.sleep(10)

#scroll down 20 times, waiting for the page to load at each time
for(i in 1:20){      
  remDr$executeScript(paste("scroll(0,",i*10000,");"))
  Sys.sleep(5)    
}


h<-remDr$getPageSource()



df <- map_dfr(h %>%
                map(~ .x %>%
                      html_nodes("div.product")), ~
                data.frame(
                  periodo = lubridate::year(Sys.Date()),
                  fecha = Sys.Date(),
                  ecommerce = "lacuracao",
                  producto = .x %>% html_node(".product_name") %>% html_text(),
                  precio.antes = .x %>% html_node('.old-price') %>% html_text(),
                  precio.actual = .x %>% html_node('#offerPriceValue') %>% html_text()
                ))

Update 1:

I've changed h<-remDr$getPageSource() to h<-remDr$getPageSource()[[1]] and now class(h) returns character.

Update 2:

Tried:

h<-remDr$getPageSource()[[1]]

hh <- h %>% read_html() %>% html_elements("div.product")

class(hh) #[1] "xml_nodeset"

But getting this when trying to form the df:

Error in data.frame(periodo = lubridate::year(Sys.Date()), fecha = Sys.Date(),  : 
  arguments imply differing number of rows: 1, 0

解决方案

Use remDr$getPageSource()[[1]] to get the actual document.

You then need to pipe that to your DOM parser i.e. remDr$getPageSource()[[1]] %>% read_html() and continue on as before i.e. ...%>% html_elements(.....).

RSelenium has its own methods for selecting elements via the Webdriver instance e.g. remDr$findElement("css", "body"). In your case, you are choosing to transfer the html across into something which you can call rvest's html_nodes() on i.e. either a document, a node set or a single node.. As the transfer is html, then read_html() is needed to generate a document for parsing.

The error inside the attempt to form a data.frame call is because you need to implement handling of missing child nodes i.e. where certain prices are missing.

这篇关于R:webscrapping 没有为应用于“字符"类对象的“xml_find_all"返回任何适用的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆