一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析 [英] One spider with 2 different URL and 2 parse using Scrapy

查看:39
本文介绍了一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个不同的域,在一个蜘蛛中运行了 2 种不同的方法

Hi I have 2 different domain with 2 different approach running in one spider I have tried this code but nothing works any idea please?

class SalesitemSpiderSpider(scrapy.Spider):
    name = 'salesitem_spider'
    allowed_domains = ['www2.hm.com','www.forever21.com']
    url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20']

   #Json Payload code here

    def start_requests(self):
       for i in self.url:
        if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
            print("sample: " + i)
            payload = self.payload.copy()
            payload['page']['pageNo'] = 1
            yield scrapy.Request(
            i, method='POST', body=json.dumps(payload),
            headers={'X-Requested-With': 'XMLHttpRequest',
                 'Content-Type': 'application/json; charset=UTF-8'},
            callback=self.parse_2, meta={'pageNo': 1})

        if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
            yield scrapy.Request(i, callback=self.parse_1)

    def parse_1(self, response):
     #Some code of getting item 

    def parse_2(self, response):
     data = json.loads(response.text)
        for product in data['CatalogProducts']:
            item = GpdealsSpiderItem_f21()
         #item yield

        yield item

        # simulate pagination if we are not at the end
        if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
            payload = self.payload.copy()
            payload['page']['pageNo'] = response.meta['pageNo'] + 1
            yield scrapy.Request(
              self.url, method='POST', body=json.dumps(payload),
             headers={'X-Requested-With': 'XMLHttpRequest',
                        'Content-Type': 'application/json; charset=UTF-8'},
               callback=self.parse_2, meta={'pageNo': payload['page']['pageNo']}
           )

我总是有这个问题

NameError: name 'url' 未定义

NameError: name 'url' is not defined

推荐答案

您在同一个类中有两个不同的蜘蛛.为了可维护性,我建议您将它们保存在不同的文件中.

You have two different spiders in the same class. For the sake of maintainability, I recommend you to keep them in different files.

如果你真的想把它们放在一起,把网址分成两个列表会更容易:

If you really want to keep them together, it would be easier split the urls into two lists:

type1_urls = ['https://www.forever21.com/eu/shop/Catalog/GetProducts', ]
type2_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20', ]

def start_requests(self):
    for url in self.type1_urls:
        payload = self.payload.copy()
        yield Request(
            # ...
            callback=self.parse_1
       )

    for url in self.type2_urls:
        yield scrapy.Request(url, callback=self.parse_2)

这篇关于一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆