Scrapy - 在一个 scrapy 脚本中抓取不同的网页 [英] Scrapy - Scraping different web pages in one scrapy script

查看:67
本文介绍了Scrapy - 在一个 scrapy 脚本中抓取不同的网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个网络应用程序,可以从不同的网站抓取一长串鞋子.这是我的两个单独的scrapy脚本:

I'm creating a web app that scrapes a long list of shoes from different websites. Here are my two individual scrapy scripts:

http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3

from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com"]
    start_urls = ['http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3']
    def parse(self, response):
        shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes)

    def parse_shoes(self, response):
        url = response.url
        name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
        price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
        price = price.replace('$','')
        shoe_type =  response.css('.exp-product-subtitle::text').extract_first()

        sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')
        sizes = sizes.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract()
        sizes = [s.strip() for s in sizes]
        yield {
            'url': url,
            'name' : name,
            'price' : price,
            'sizes' : sizes,
            'shoe_type': shoe_type
        }

http://www.dickssportinggoods.com/products/clearance-soccer-cleats.jsp

    from scrapy import Spider
    from scrapy.http import Request
    class ShoesSpider(Spider):
        name = "shoes"
        allowed_domains = ["dickssportinggoods.com"]
        start_urls = ['http://www.dickssportinggoods.com/products/clearance-soccer-cleats.jsp']
        def parse(self, response):
            shoes = response.xpath('//*[@class="fplpTitle header4"]/a/@href').extract()
            for shoe in shoes:
                yield Request(shoe, callback=self.parse_shoes)
        def parse_shoes(self, response):
            sizes = response.xpath('//*[@class="swatches clearfix"]/input/@value').extract()
            if sizes == []:
                pass
            url = response.url
            name = response.xpath('.//*[@id="PageHeading_3074457345618261107"]/h1/text()').extract_first()
            price = response.xpath('.//*[@itemprop="price"]/text()').extract_first()
            #shoe_type =  response.css('.exp-product-subtitle::text').extract_first()
            yield {
                    'url': url,
                    'name' : name,
                    'price' : price,
                    'sizes' : sizes,
                    'shoe_type': ''
                }

我怎样才能把它们放在一起?我已经浏览了scrapy文档,但我没有看到他们提到这一点,它只是提到了如何从根地址中抓取两个地址.谢谢

How can I manage to put both of them together? I already went through the scrapy documentation and I haven't seen them mentioning this, it just mentions how to scrape two addresses from a root address. Thanks

推荐答案

将您的两个域都放在 allowed_domains 中,并将两个 URL 放在 start_urls 中,然后使用简单的 if-else 来确定要执行的代码部分.

Put your both domains in allowed_domains and put your both URLs in start_urls and then use simple if-else to determine what part of code to execute.

from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
    name = "shoes"
    allowed_domains = ["store.nike.com", "dickssportinggoods.com"]
    start_urls = ['http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3', 'http://www.dickssportinggoods.com/products/clearance-soccer-cleats.jsp']
    def parse(self, response):

        if "store.nike.com" in response.url:
            shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
        elif "dickssportinggoods.com" in response.url:
            shoes = response.xpath('//*[@class="fplpTitle header4"]/a/@href').extract()

        for shoe in shoes:
            yield Request(shoe, callback=self.parse_shoes)

    def parse_shoes(self, response):
        url = response.url

        if "store.nike.com" in response.url:
            name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
            price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
            price = price.replace('$','')
            shoe_type =  response.css('.exp-product-subtitle::text').extract_first()

            sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')
            sizes = sizes.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract()
            sizes = [s.strip() for s in sizes]
            yield {
                'url': url,
                'name' : name,
                'price' : price,
                'sizes' : sizes,
                'shoe_type': shoe_type
            }
        elif "dickssportinggoods.com" in response.url:
                sizes = response.xpath('//*[@class="swatches clearfix"]/input/@value').extract()
                if sizes == []:
                    pass
                url = response.url
                name = response.xpath('.//*[@id="PageHeading_3074457345618261107"]/h1/text()').extract_first()
                price = response.xpath('.//*[@itemprop="price"]/text()').extract_first()
                #shoe_type =  response.css('.exp-product-subtitle::text').extract_first()

                yield {
                        'url': url,
                        'name' : name,
                        'price' : price,
                        'sizes' : sizes,
                        'shoe_type': ''
                }

这篇关于Scrapy - 在一个 scrapy 脚本中抓取不同的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆