如何从无限滚动网站中抓取所有内容?刮的 [英] How to scrape all contents from infinite scroll website? scrapy

查看:45
本文介绍了如何从无限滚动网站中抓取所有内容?刮的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是scrapy.

I'm using scrapy.

我使用的网站可以无限滚动.

The website i'm using has infinite scroll.

该网站有大量帖子,但我只抓取了 13 个.

the website has loads of posts but i only scraped 13.

如何抓取其余的帖子?

这是我的代码:

class exampleSpider(scrapy.Spider):
name = "example"
#from_date = datetime.date.today() - datetime.timedelta(6*365/12)
allowed_domains = ["example.com"]
start_urls = [
    "http://www.example.com/somethinghere/"
]

def parse(self, response):
  for href in response.xpath("//*[@id='page-wrap']/div/div/div/section[2]/div/div/div/div[3]/ul/li/div/h1/a/@href"):
    url = response.urljoin(href.extract())
    yield scrapy.Request(url, callback=self.parse_dir_contents)


def parse_dir_contents(self, response):
    #scrape contents code here

推荐答案

检查网站代码.

如果无限滚动是自动触发js动作,可以使用Alioth提案进行如下操作:spynner

If the infinite scroll is automatically triggering js action, you could proceed as follows using the Alioth proposal: spynner

按照 spynner docs,您可以发现可以触发jquery事件.

Following the spynner docs, you can find that can trigger jquery events.

查找库代码以了解可以触发的事件类型.

Look up the library code to see which kind of events you can fire.

尝试在网站的可滚动内容内的任何 div 上生成 滚动到底部 事件或创建 css 属性更改.关注 spynner docs,例如:>

Try to generate a scroll to bottom event or create a css property change on any of the divs inside the scrollable content in the website. Following spynner docs, something like:

browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
# load here your website as spynner allows
browser.load_jquery(True)
ret = run_debug(browser.runjs,'window.scrollTo(0, document.body.scrollHeight);console.log(''scrolling...);')
# continue parsing ret 

无限滚动不太可能由锚链接触发,但可能可以由 jquery 操作触发,而不需要附加到链接.对于这种情况,请使用如下代码:

It is not quite probable that an infinite scroll is triggered by an anchor link, but maybe can be triggered by a jquery action, not necesarry attached to a link. For this case use code like the following:

br.load('http://pypi.python.org/pypi')

anchors = br.webframe.findAllElements('#menu ul.level-two a')
# chooses an anchor with Browse word as key
anchor = [a for a in anchors if 'Browse' in a.toPlainText()][0]
br.wk_click_element_link(anchor, timeout=10)
output = br.show()
# save output in file: output.html or 
# plug this actions into your scrapy method and parse output var as you do 
# with response body

然后,在 output.html 文件上运行 scrapy,或者,如果您实现了它,则使用您选择的本地内存变量来存储 js 操作后修改后的 html.

Then, run scrapy on the output.html file or, if you implemented it so, using the local memory variable you choosed to store the modified html after the js action.

作为另一种解决方案,您尝试解析的网站可能具有替代呈现版本,以防访问者浏览器 js 激活.

As another solution, the website you are trying to parse might have an alternate render version in case the visitor browser has not js activated.

尝试使用禁用 javascript 的浏览器呈现网站,也许这样,网站会在内容部分的末尾提供一个锚链接.

Try to render the website with a javascript disabled browser, and maybe that way, the website makes available an anchor link at the end of the content section.

还有成功的爬虫js导航实现,使用Scrapy和Selenium的方法在this 所以回答.

Also there are successful implementations of crawler js navigation using the approach with Scrapy together with Selenium detailed in this so answer.

这篇关于如何从无限滚动网站中抓取所有内容?刮的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆