如何使用无限滚动抓取网站? [英] How to scrape website with infinte scrolling?

查看:24
本文介绍了如何使用无限滚动抓取网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取这个网站.我写了一个蜘蛛,但它只抓取首页,即前 52 项.

I want to crawl this website. I have written a spider but it is only crawling the front page, i.e. the top 52 items.

我试过这个代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
a=[]
from aqaq.items import aqaqItem
import os
import urlparse
import ast

    class aqaqspider(BaseSpider):
        name = "jabong"
        allowed_domains = ["jabong.com"]
        start_urls = [
            "http://www.jabong.com/women/clothing/womens-tops/",
        ]

        def parse(self, response):
            # ... Extract items in the page using extractors
                    n=3
                    ct=1

                    hxs = HtmlXPathSelector(response)
                    sites=hxs.select('//div[@id="page"]')
                    for site in sites:
                            name=site.select('//div[@id="content"]/div[@class="l-pageWrapper"]/div[@class="l-main"]/div[@class="box box-bgcolor"]/section[@class="box-bd pan mtm"]/ul[@id="productsCatalog"]/li/a/@href').extract()
                            print name
                            print ct
                            ct=ct+1
                            a.append(name)
                    req= Request (url="http://www.jabong.com/women/clothing/womens-tops/?page=" + str(n) ,
                    headers = {"Referer": "http://www.jabong.com/women/clothing/womens-tops/",
                            "X-Requested-With": "XMLHttpRequest"},callback=self.parse,dont_filter=True)

                    return req # and your items

它显示以下输出:

2013-10-31 09:22:42-0500 [jabong] DEBUG: Crawled (200) <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> (referer: http://www.jabong.com/women/clothing/womens-tops/)
2013-10-31 09:22:42-0500 [jabong] DEBUG: Filtered duplicate request: <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2013-10-31 09:22:42-0500 [jabong] INFO: Closing spider (finished)
2013-10-31 09:22:42-0500 [jabong] INFO: Dumping Scrapy stats:

当我输入 dont_filter=True 时,它永远不会停止.

When I put dont_filter=True it will never stop.

推荐答案

是的,这里必须使用 dont_filter 因为在 XHR 中只有 page GET 参数更改每次向下滚动页面到底部时都请求 http://www.jabong.com/women/clothing/womens-tops/?page=X.

Yes, dont_filter has to be used here since there is only page GET parameter changing in the XHR request to http://www.jabong.com/women/clothing/womens-tops/?page=X each time you scroll the page down to bottom.

现在您需要弄清楚如何停止爬行.这实际上很简单 - 只需检查队列中的下一页上何时没有产品并提出 CloseSpider 异常.

Now you need to figure out how to stop crawling. This is actually simple - just check when there is no products on the next page in the queue and raise CloseSpider exception.

这是一个对我有用的完整代码示例(停在第 234 页):

Here is a complete code example that works for me (stops at page number 234):

import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request


class Product(scrapy.Item):
    brand = scrapy.Field()
    title = scrapy.Field()


class aqaqspider(BaseSpider):
    name = "jabong"
    allowed_domains = ["jabong.com"]
    start_urls = [
        "http://www.jabong.com/women/clothing/womens-tops/?page=1",
    ]
    page = 1

    def parse(self, response):
        products = response.xpath("//li[@data-url]")

        if not products:
            raise CloseSpider("No more products!")

        for product in products:
            item = Product()
            item['brand'] = product.xpath(".//span[contains(@class, 'qa-brandName')]/text()").extract()[0].strip()
            item['title'] = product.xpath(".//span[contains(@class, 'qa-brandTitle')]/text()").extract()[0].strip()
            yield item

        self.page += 1
        yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%d" % self.page,
                      headers={"Referer": "http://www.jabong.com/women/clothing/womens-tops/", "X-Requested-With": "XMLHttpRequest"},
                      callback=self.parse, 
                      dont_filter=True)

这篇关于如何使用无限滚动抓取网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆