在R中使用React JS抓取网页 [英] Scraping webpage with react JS in R
本文介绍了在R中使用React JS抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试抓取以下页面:
https://metro.zakaz.ua/uk/?promotion=1
此页面包含反应内容.
我可以用代码抓取第一页:
I'm trying to scrape page below :
https://metro.zakaz.ua/uk/?promotion=1
This page with react content.
I can scrape first page with code:
url="https://metro.zakaz.ua/uk/?promotion=1"
read_html(url)%>%
html_nodes("script")%>%
.[[8]] %>%
html_text()%>%
fromJSON()%>%
.$catalog%>%.$items%>%
data.frame
结果是我拥有第一页中的所有项目,但我不知道如何抓取其他页面.
如果可以,此js代码将移至其他页面:
In result I have all items from first page, but I don't know how to scrape others pages.
This js code move to other page if that can help:
document.querySelectorAll('.catalog-pagination')[0].children[1].children[0].click()
感谢您的帮助!
推荐答案
您将需要'RSelenum'来执行无头导航.
You will need 'RSelenum' to perform headless navigation.
检查设置:如何为R设置硒? /a>
Check out for setting up: How to set up rselenium for R?
library(RSelenium)
library(rvest)
library(tidyvers)
url="https://metro.zakaz.ua/uk/?promotion=1"
rD <- rsDriver(port=4444L, browser="chrome")
remDr <- rD[['client']]
remDr$navigate(url)
### adjust items you want to scrape
src <- remDr$getPageSource()[[1]]
pg <- read_html(src)
tbl <- tibble(
product_name = pg %>% html_nodes(".product-card-name") %>% html_text(),
product_info = pg %>% html_nodes(".product-card-info") %>% html_text()
)
## to handle pagenation (tested with 5 pages) - adjust accordinly
for (i in 2:5) {
pages <- remDr$findElement(using = 'css selector',str_c(".page:nth-child(",i,")"))
pages$clickElement()
## wait 5 sec to load
Sys.sleep(5)
src <- remDr$getPageSource()[[1]]
pg <- read_html(src)
data <- tibble(
product_name = pg %>% html_nodes(".product-card-name") %>% html_text(),
product_info = pg %>% html_nodes(".product-card-info") %>% html_text()
)
tbl <- tbl %>% bind_rows(data)
}
nrow(tbl)
head(tbl)
tail(tbl)
这是一个快速的输出:
这篇关于在R中使用React JS抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文