Scrapy - 在一个 scrapy 脚本中抓取不同的网页 [英] Scrapy - Scraping different web pages in one scrapy script
问题描述
我正在创建一个网络应用程序,可以从不同的网站抓取一长串鞋子.这是我的两个单独的scrapy脚本:
I'm creating a web app that scrapes a long list of shoes from different websites. Here are my two individual scrapy scripts:
http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3
from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com"]
start_urls = ['http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3']
def parse(self, response):
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
url = response.url
name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
price = price.replace('$','')
shoe_type = response.css('.exp-product-subtitle::text').extract_first()
sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')
sizes = sizes.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract()
sizes = [s.strip() for s in sizes]
yield {
'url': url,
'name' : name,
'price' : price,
'sizes' : sizes,
'shoe_type': shoe_type
}
http://www.dickssportinggoods.com/products/clearance-soccer-cleats.jsp
from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["dickssportinggoods.com"]
start_urls = ['http://www.dickssportinggoods.com/products/clearance-soccer-cleats.jsp']
def parse(self, response):
shoes = response.xpath('//*[@class="fplpTitle header4"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
sizes = response.xpath('//*[@class="swatches clearfix"]/input/@value').extract()
if sizes == []:
pass
url = response.url
name = response.xpath('.//*[@id="PageHeading_3074457345618261107"]/h1/text()').extract_first()
price = response.xpath('.//*[@itemprop="price"]/text()').extract_first()
#shoe_type = response.css('.exp-product-subtitle::text').extract_first()
yield {
'url': url,
'name' : name,
'price' : price,
'sizes' : sizes,
'shoe_type': ''
}
我怎样才能把它们放在一起?我已经浏览了scrapy文档,但我没有看到他们提到这一点,它只是提到了如何从根地址中抓取两个地址.谢谢
How can I manage to put both of them together? I already went through the scrapy documentation and I haven't seen them mentioning this, it just mentions how to scrape two addresses from a root address. Thanks
推荐答案
将您的两个域都放在 allowed_domains
中,并将两个 URL 放在 start_urls
中,然后使用简单的 if-else 来确定要执行的代码部分.
Put your both domains in allowed_domains
and put your both URLs in start_urls
and then use simple if-else to determine what part of code to execute.
from scrapy import Spider
from scrapy.http import Request
class ShoesSpider(Spider):
name = "shoes"
allowed_domains = ["store.nike.com", "dickssportinggoods.com"]
start_urls = ['http://store.nike.com/us/en_us/pw/mens-clearance-soccer-shoes/47Z7puZ896Zoi3', 'http://www.dickssportinggoods.com/products/clearance-soccer-cleats.jsp']
def parse(self, response):
if "store.nike.com" in response.url:
shoes = response.xpath('//*[@class="grid-item-image-wrapper sprite-sheet sprite-index-0"]/a/@href').extract()
elif "dickssportinggoods.com" in response.url:
shoes = response.xpath('//*[@class="fplpTitle header4"]/a/@href').extract()
for shoe in shoes:
yield Request(shoe, callback=self.parse_shoes)
def parse_shoes(self, response):
url = response.url
if "store.nike.com" in response.url:
name = response.xpath('//*[@itemprop="name"]/text()').extract_first()
price = response.xpath('//*[@itemprop="price"]/text()').extract_first()
price = price.replace('$','')
shoe_type = response.css('.exp-product-subtitle::text').extract_first()
sizes = response.xpath('//*[@class="nsg-form--drop-down exp-pdp-size-dropdown exp-pdp-dropdown two-column-dropdown"]/option')
sizes = sizes.xpath('text()[not(parent::option/@class="exp-pdp-size-not-in-stock selectBox-disabled")]').extract()
sizes = [s.strip() for s in sizes]
yield {
'url': url,
'name' : name,
'price' : price,
'sizes' : sizes,
'shoe_type': shoe_type
}
elif "dickssportinggoods.com" in response.url:
sizes = response.xpath('//*[@class="swatches clearfix"]/input/@value').extract()
if sizes == []:
pass
url = response.url
name = response.xpath('.//*[@id="PageHeading_3074457345618261107"]/h1/text()').extract_first()
price = response.xpath('.//*[@itemprop="price"]/text()').extract_first()
#shoe_type = response.css('.exp-product-subtitle::text').extract_first()
yield {
'url': url,
'name' : name,
'price' : price,
'sizes' : sizes,
'shoe_type': ''
}
这篇关于Scrapy - 在一个 scrapy 脚本中抓取不同的网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!