一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析 [英] One spider with 2 different URL and 2 parse using Scrapy

查看：39 发布时间：2021/7/16 22:25:45 python web-scraping scrapy

本文介绍了一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有 2 个不同的域，在一个蜘蛛中运行了 2 种不同的方法

Hi I have 2 different domain with 2 different approach running in one spider I have tried this code but nothing works any idea please?

class SalesitemSpiderSpider(scrapy.Spider):
    name = 'salesitem_spider'
    allowed_domains = ['www2.hm.com','www.forever21.com']
    url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20']

   #Json Payload code here

    def start_requests(self):
       for i in self.url:
        if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
            print("sample: " + i)
            payload = self.payload.copy()
            payload['page']['pageNo'] = 1
            yield scrapy.Request(
            i, method='POST', body=json.dumps(payload),
            headers={'X-Requested-With': 'XMLHttpRequest',
                 'Content-Type': 'application/json; charset=UTF-8'},
            callback=self.parse_2, meta={'pageNo': 1})

        if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
            yield scrapy.Request(i, callback=self.parse_1)

    def parse_1(self, response):
     #Some code of getting item 

    def parse_2(self, response):
     data = json.loads(response.text)
        for product in data['CatalogProducts']:
            item = GpdealsSpiderItem_f21()
         #item yield

        yield item

        # simulate pagination if we are not at the end
        if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
            payload = self.payload.copy()
            payload['page']['pageNo'] = response.meta['pageNo'] + 1
            yield scrapy.Request(
              self.url, method='POST', body=json.dumps(payload),
             headers={'X-Requested-With': 'XMLHttpRequest',
                        'Content-Type': 'application/json; charset=UTF-8'},
               callback=self.parse_2, meta={'pageNo': payload['page']['pageNo']}
           )

我总是有这个问题

NameError: name 'url' 未定义

NameError: name 'url' is not defined

推荐答案

您在同一个类中有两个不同的蜘蛛.为了可维护性，我建议您将它们保存在不同的文件中.

You have two different spiders in the same class. For the sake of maintainability, I recommend you to keep them in different files.

如果你真的想把它们放在一起，把网址分成两个列表会更容易:

If you really want to keep them together, it would be easier split the urls into two lists:

type1_urls = ['https://www.forever21.com/eu/shop/Catalog/GetProducts', ]
type2_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20', ]

def start_requests(self):
    for url in self.type1_urls:
        payload = self.payload.copy()
        yield Request(
            # ...
            callback=self.parse_1
       )

    for url in self.type2_urls:
        yield scrapy.Request(url, callback=self.parse_2)

这篇关于一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析 [英] One spider with 2 different URL and 2 parse using Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析 [英] One spider with 2 different URL and 2 parse using Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭