一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析 [英] One spider with 2 different URL and 2 parse using Scrapy
本文介绍了一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有 2 个不同的域,在一个蜘蛛中运行了 2 种不同的方法
Hi I have 2 different domain with 2 different approach running in one spider I have tried this code but nothing works any idea please?
class SalesitemSpiderSpider(scrapy.Spider):
name = 'salesitem_spider'
allowed_domains = ['www2.hm.com','www.forever21.com']
url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20']
#Json Payload code here
def start_requests(self):
for i in self.url:
if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
print("sample: " + i)
payload = self.payload.copy()
payload['page']['pageNo'] = 1
yield scrapy.Request(
i, method='POST', body=json.dumps(payload),
headers={'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8'},
callback=self.parse_2, meta={'pageNo': 1})
if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
yield scrapy.Request(i, callback=self.parse_1)
def parse_1(self, response):
#Some code of getting item
def parse_2(self, response):
data = json.loads(response.text)
for product in data['CatalogProducts']:
item = GpdealsSpiderItem_f21()
#item yield
yield item
# simulate pagination if we are not at the end
if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
payload = self.payload.copy()
payload['page']['pageNo'] = response.meta['pageNo'] + 1
yield scrapy.Request(
self.url, method='POST', body=json.dumps(payload),
headers={'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8'},
callback=self.parse_2, meta={'pageNo': payload['page']['pageNo']}
)
我总是有这个问题
NameError: name 'url' 未定义
NameError: name 'url' is not defined
推荐答案
您在同一个类中有两个不同的蜘蛛.为了可维护性,我建议您将它们保存在不同的文件中.
You have two different spiders in the same class. For the sake of maintainability, I recommend you to keep them in different files.
如果你真的想把它们放在一起,把网址分成两个列表会更容易:
If you really want to keep them together, it would be easier split the urls into two lists:
type1_urls = ['https://www.forever21.com/eu/shop/Catalog/GetProducts', ]
type2_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20', ]
def start_requests(self):
for url in self.type1_urls:
payload = self.payload.copy()
yield Request(
# ...
callback=self.parse_1
)
for url in self.type2_urls:
yield scrapy.Request(url, callback=self.parse_2)
这篇关于一个蜘蛛有 2 个不同的 URL 和 2 个使用 Scrapy 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文