使用“加载更多"抓取无限滚动页面使用 Scrapy 的按钮 [英] Scraping Infinite Scrolling Pages with "load more" button using Scrapy

查看:74
本文介绍了使用“加载更多"抓取无限滚动页面使用 Scrapy 的按钮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果响应是 html/text 而不是 json,您如何通过无限滚动来废弃网页.

How do you scrap a web page with infinite scrolling where the response is html/text instead of json.

我的第一次尝试是使用 Rule 和 LinkExtractor,它为我提供了大约 80% 的工作网址

My first try was using Rule and LinkExtractor which gets me around 80% of the jobs url

class JobsetSpider(CrawlSpider):
    name = 'test'
    allowed_domains = ['jobs.et']
    start_urls = ['https://jobs.et/jobs/']

    rules = (
        Rule(LinkExtractor(allow='https://jobs.et/job/\d+/'), callback='parse_link'),
        Rule(LinkExtractor(), follow=True),
    )

    def parse_link(self, response):
        yield {
            'url': response.url
        }

我的第二次尝试是使用 SCRAPING INFINITE SCROLLING PAGES 但响应是 text/html 而不是 json.

My second attempt was to use the example from SCRAPING INFINITE SCROLLING PAGES but the response is in text/html not json.

单击加载更多"按钮时,我可以从 Chrome 开发者工具上的网络看到请求 url

When "load more" button clicked, i can see from Network on Chrome Developer tool the request url

https://jobs.et/jobs/?searchId=1509738711.5142&action=search&page=2

而页面"数量增加.

我的问题是

  1. 当我使用scrapy时,我如何从响应头中提取上面的url点击加载更多"按钮
  2. 有没有更好的方法来解决这个问题?

推荐答案

忽略加载更多"按钮.

正如您所提到的,您可以使用 URL 访问所有职位页面.当您解析第一页结果时,从 header 元素中找到作业总数

You can access all the pages of jobs using URLs, as you mention. When you parse the first page of results find the total number of jobs from the header element

<h1 class="search-results__title ">
268 jobs found
</h1>

该网站每页显示 20 个作业,因此您需要抓取 268/20 = 13.4(四舍五入为 14)页.

The site displays 20 jobs per page, so you need to scrape 268/20 = 13.4 (rounded up to 14) pages.

当您完成第一页的解析后,创建一个生成器以生成后续页面的 URL(循环最多 14 个)并使用另一个函数解析结果.您将需要 searchId,您无法从 URL 获取它,但它位于页面的隐藏字段中.

When you finish parsing the first page create a generator to yield URLS for the subsequent pages (in a loop up to 14) and parse the result with another function. You will need the searchId which you can't get from the URL but it's in a hidden field on the page.

<input type="hidden" name="searchId" value="1509738711.5142">

使用它和页码,您可以构建您的网址

Using that and the page number you can build your URLs

https://jobs.et/jobs/?searchId=<id>&action=search&page=<page>

是的,解析函数的作用与您的第一个页面解析器完全相同,但是当您完成所有工作时,最好忍受代码重复以保持头脑清醒.

Yes, the parse function will be doing exactly the same as your first page parser, but while you get it all working it's good to live with the code duplication to keep things straight in your head.

此代码可能类似于

class JobsetSpider(CrawlSpider):
    ...
    start_urls = ['https://jobs.et/jobs/']
    ...

    def parse(self, response):
        # parse the page of jobs
        ...
        job_count = xpath(...)
        search_id = xpath(...)
        pages =  math.ceil(job_count / 20.0)
        for page in range(2, pages):
            url = 'https://jobs.et/jobs/?searchId={}&action=search&page={}'.format(search_id, page)
            yield Request(url, callback = self.parseNextPage)

    def parseNextPage(self, response):
        # parse the next and subsequent pages of jobs
        ...

这篇关于使用“加载更多"抓取无限滚动页面使用 Scrapy 的按钮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆