抓图下载 [英] Scrapy Images Downloading
问题描述
我的蜘蛛运行时没有显示任何错误,但是图像没有存储在文件夹中,这是我的抓取文件:
My spider runs without displaying any errors but the images are not stored in the folder here are my scrapy files:
Spider.py:
import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItem
class productionSpider(scrapy.Spider):
name = "production"
allowed_domains = ["someurl.com"]
start_urls = [
"someurl.com"
]
def parse(self, response):
for sel in response.xpath('//html/body'):
item = ProductionItem()
img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0]
yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo, meta={'item': item})
def parseBasicListingInfo(item, response):
item = response.request.meta['item']
item = ListResidentialItem()
try:
image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract())
item['image_urls'] = [ x for x in image_urls]
except IndexError:
item['image_urls'] = ''
return item
settings.py:
from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipeline
BOT_NAME = 'production'
SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'
CONCURRENT_REQUESTS = 250
DOWNLOAD_DELAY = 2
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}
items.py
# -*- coding: utf-8 -*-
import scrapy
class ProductionItem(scrapy.Item):
img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
pass
我的管道文件为空,我不确定该添加什么内容.
My pipeline file is empty i'm not sure what i am suppose to add to the pipeline.py file.
非常感谢您的帮助.
推荐答案
由于您不知道要在管道中放置什么内容,因此我假设您可以对scrapy提供的图像使用默认管道,因此在settings.py
文件中,可以像这样声明它
Since you don't know what to put in the pipelines I assume you can use the default pipeline for images provided by scrapy so in the settings.py
file you can just declare it like
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':1
}
另外,您的图像路径错误/
表示您要转到计算机的绝对根路径,因此您可以将绝对路径放到要保存的位置,或者只是从您的位置做一个相对路径正在运行您的搜寻器
Also, your images path is wrong the /
means that you are going to the absolute root path of your machine, so you either put the absolute path to where you want to save or just do a relative path from where you are running your crawler
IMAGES_STORE = '/home/user/Documents/scrapy_project/images'
或
IMAGES_STORE = 'images'
现在,在蜘蛛中,您提取了网址,但没有将其保存到项目中
Now, in the spider you extract the url but you don't save it into the item
item['image_urls'] = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract_first()
如果您使用的是默认管道,则该字段必须为image_urls
.
The field has to literally be image_urls
if you're using the default pipeline.
现在,在items.py
文件中,您需要添加以下2个字段(均需使用此文字名称)
Now, in the items.py
file you need to add the following 2 fields (both are required with this literal name)
image_urls=Field()
images=Field()
应该可以
这篇关于抓图下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!