使用项目字段中的内容重命名 Scrapy 0.24 中下载的图像,同时避免文件名冲突? [英] Renaming downloaded images in Scrapy 0.24 with content from an item field while avoiding filename conflicts?

查看:23
本文介绍了使用项目字段中的内容重命名 Scrapy 0.24 中下载的图像,同时避免文件名冲突?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试重命名由我的 Scrapy 0.24 蜘蛛下载的图像.现在,下载的图像以其 URL 的 SHA1 哈希值作为文件名存储.我想将它们命名为我使用 item['model'] 提取的值.2011 年的这个问题概述了我想要的,但答案是适用于以前版本的 Scrapy,不适用于最新版本.

I'm attempting to rename the images that are downloaded by my Scrapy 0.24 spider. Right now the downloaded images are stored with a SHA1 hash of their URLs as the file names. I'd like to instead name them the value I extract with item['model']. This question from 2011 outlines what I want, but the answers are for previous versions of Scrapy and don't work with the latest version.

一旦我设法让它工作,我还需要确保我考虑使用相同文件名下载的不同图像.所以我需要将每个图像下载到它自己唯一命名的文件夹中,大概是基于原始 URL.

Once I manage to get this working I'll also need to make sure I account for different images being downloaded with the same filename. So I'll need to download each image to its own uniquely named folder, presumably based on the original URL.

这是我在管道中使用的代码的副本.我从上面链接中的最新答案 中获得了此代码,但它对我不起作用.没有任何错误,图像正常下载.我的额外代码似乎对文件名没有任何影响,因为它们仍然显示为 SHA1 哈希值.

Here is a copy of the code I am using in my pipeline. I got this code from a more recent answer in the link above, but it's not working for me. Nothing errors out and the images are downloaded as normal. It doesn't seem my extra code has any effect on the filenames as they still appear as SHA1 hashes.

pipelines.py

class AllenheathPipeline(object):
    def process_item(self, item, spider):
        return item

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        item=request.meta['item'] # Like this you can use all from item, not just url.
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #yield Request(item['images']) # Adding meta. I don't know, how to put it in one line :-)
        for image in item['images']:
            yield Request(image)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

settings.py

BOT_NAME = 'allenheath'

SPIDER_MODULES = ['allenheath.spiders']
NEWSPIDER_MODULE = 'allenheath.spiders'

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}

IMAGES_STORE = 'c:/allenheath/images'

products.py(我的蜘蛛)

import scrapy
import urlparse

from allenheath.items import ProductItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

class productsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["http://www.allen-heath.com/"]
    start_urls = [
        "http://www.allen-heath.com/ahproducts/ilive-80/",
        "http://www.allen-heath.com/ahproducts/ilive-112/"
    ]

    def parse(self, response):
        for sel in response.xpath('/html'):
            item = ProductItem()
            item['model'] = sel.css('#prodsingleouter > div > div > h2::text').extract() # The value I'd like to use to name my images.
            item['shortdesc'] = sel.css('#prodsingleouter > div > div > h3::text').extract()
            item['desc'] = sel.css('#tab1 #productcontent').extract()
            item['series'] = sel.css('#pagestrip > div > div > a:nth-child(3)::text').extract()
            item['imageorig'] = sel.css('#prodsingleouter > div > div > h2::text').extract()
            item['image_urls'] = sel.css('#tab1 #productcontent .col-sm-9 img').xpath('./@src').extract()
            item['image_urls'] = [urlparse.urljoin(response.url, url) for url in item['image_urls']]
            yield item

items.py

import scrapy

class ProductItem(scrapy.Item):
    model = scrapy.Field()
    itemcode = scrapy.Field()
    shortdesc = scrapy.Field()
    desc = scrapy.Field()
    series = scrapy.Field()
    imageorig = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

这是我运行蜘蛛时从命令提示符得到的输出的粘贴箱:http://pastebin.com/ir7YZFqf

Here's a pastebin of the output I get from the command prompt when I run the spider: http://pastebin.com/ir7YZFqf

任何帮助将不胜感激!

推荐答案

pipelines.py:

The pipelines.py:

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy import log

class MyImagesPipeline(ImagesPipeline):

    #Name download version
    def file_path(self, request, response=None, info=None):
        image_guid = request.meta['model'][0]
        log.msg(image_guid, level=log.DEBUG)
        return 'full/%s' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + request.url.split('/')[-1]
        log.msg(image_guid, level=log.DEBUG)
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        yield Request(item['image_urls'][0], meta=item)

您使用的 settings.py 是错误的.你应该使用这个:

You're using the settings.py wrong. You should use this:

ITEM_PIPELINES = {'allenheath.pipelines.MyImagesPipeline': 1}

要使缩略图正常工作,请将其添加到 settings.py:

For thumbsnails to work, add this to settings.py:

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (100, 100),
}

这篇关于使用项目字段中的内容重命名 Scrapy 0.24 中下载的图像,同时避免文件名冲突?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆