抓图下载 [英] Scrapy Images Downloading

查看:100
本文介绍了抓图下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的蜘蛛运行时没有显示任何错误,但是图像没有存储在文件夹中,这是我的抓取文件:

My spider runs without displaying any errors but the images are not stored in the folder here are my scrapy files:

Spider.py:

import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["someurl.com"]
    start_urls = [
        "someurl.com"
]

def parse(self, response):
    for sel in response.xpath('//html/body'):
        item = ProductionItem()
        img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0]
        yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo,  meta={'item': item})

def parseBasicListingInfo(item, response):
    item = response.request.meta['item']
    item = ListResidentialItem()
    try:
        image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract())
        item['image_urls'] = [ x for x in image_urls]
    except IndexError:
        item['image_urls'] = ''

    return item

settings.py:

from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipeline

BOT_NAME = 'production'

SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'

ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'

CONCURRENT_REQUESTS = 250

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}

items.py

# -*- coding: utf-8 -*-
import scrapy

class ProductionItem(scrapy.Item):
    img_url = scrapy.Field()

# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

    pass

我的管道文件为空,我不确定该添加什么内容.

My pipeline file is empty i'm not sure what i am suppose to add to the pipeline.py file.

非常感谢您的帮助.

推荐答案

由于您不知道要在管道中放置什么内容,因此我假设您可以对scrapy提供的图像使用默认管道,因此在settings.py文件中,可以像这样声明它

Since you don't know what to put in the pipelines I assume you can use the default pipeline for images provided by scrapy so in the settings.py file you can just declare it like

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':1
}

另外,您的图像路径错误/表示您要转到计算机的绝对根路径,因此您可以将绝对路径放到要保存的位置,或者只是从您的位置做一个相对路径正在运行您的搜寻器

Also, your images path is wrong the / means that you are going to the absolute root path of your machine, so you either put the absolute path to where you want to save or just do a relative path from where you are running your crawler

IMAGES_STORE = '/home/user/Documents/scrapy_project/images'

IMAGES_STORE = 'images'

现在,在蜘蛛中,您提取了网址,但没有将其保存到项目中

Now, in the spider you extract the url but you don't save it into the item

item['image_urls'] = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract_first()

如果您使用的是默认管道,则该字段必须为image_urls.

The field has to literally be image_urls if you're using the default pipeline.

现在,在items.py文件中,您需要添加以下2个字段(均需使用此文字名称)

Now, in the items.py file you need to add the following 2 fields (both are required with this literal name)

image_urls=Field()
images=Field()

应该可以

这篇关于抓图下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆