无限滚动抓取动态电子商务页面 [英] Scraping a dynamic ecommerce page with infinite scroll

查看:105
本文介绍了无限滚动抓取动态电子商务页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中使用rvest进行抓取.我知道一些HTML和CSS.

I'm using rvest in R to do some scraping. I know some HTML and CSS.

我想获取URI的每种产品的价格:

I want to get the prices of every product of a URI:

http://www.linio.com.co/tecnologia/celulares-telefonia-gps/

当您在页面上向下滚动(进行一些滚动操作)时,将加载新项目.

The new items load as you go down on the page (as you do some scrolling).

到目前为止我所做的:

Linio_Celulares <- html("http://www.linio.com.co/celulares-telefonia-gps/")

Linio_Celulares %>%
  html_nodes(".product-itm-price-new") %>%
  html_text()

我得到了我所需要的,但仅针对前25个元素(默认情况下为那些负载).

And i get what i need, but just for the 25 first elements (those load for default).

 [1] "$ 1.999.900" "$ 1.999.900" "$ 1.999.900" "$ 2.299.900" "$ 2.279.900"
 [6] "$ 2.279.900" "$ 1.159.900" "$ 1.749.900" "$ 1.879.900" "$ 189.900"  
[11] "$ 2.299.900" "$ 2.499.900" "$ 2.499.900" "$ 2.799.000" "$ 529.900"  
[16] "$ 2.699.900" "$ 2.149.900" "$ 189.900"   "$ 2.549.900" "$ 1.395.900"
[21] "$ 249.900"   "$ 41.900"    "$ 319.900"   "$ 149.900" 

问题:如何获取此动态部分的所有元素?

我想,我可以滚动页面,直到所有元素都被加载,然后再使用html(URL).但这似乎需要大量工作(我计划在不同的部分进行此工作).应该有一个编程的解决方法.

I guess, I could scroll the page until all elements are loaded and then use html(URL). But this seems like a lot of work (i'm planning of doing this on different sections). There should be a programmatic work around.

推荐答案

按照@nrussell的建议,您可以使用RSelenium以编程方式向下滚动页面,然后获取源代码.

As @nrussell suggested, you can use RSelenium to programatically scroll down the page before getting the source code.

例如,您可以这样做:

library(RSelenium)
library(rvest)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

#navigate to your page
remDr$navigate("http://www.linio.com.co/tecnologia/celulares-telefonia-gps/")

#scroll down 5 times, waiting for the page to load at each time
for(i in 1:5){      
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)    
}

#get the page html
page_source<-remDr$getPageSource()

#parse it
html(page_source[[1]]) %>% html_nodes(".product-itm-price-new") %>%
  html_text()

这篇关于无限滚动抓取动态电子商务页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆