Scrapy:使用重命名默认图像名称自定义图像管道 [英] Scrapy: customize Image pipeline with renaming defualt image name

查看:22
本文介绍了Scrapy:使用重命名默认图像名称自定义图像管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用图像管道从不同网站下载所有图像.

I am using image pipeline to download all the images from different websites.

所有图片都成功下载到我定义的文件夹中,但我无法在保存到硬盘之前命名我选择的下载图片.

All the images are successfully downloaded to my defined folder, but I am unable to name the downloaded image of my choice before saving in hard disk.

这是我的代码

class jellyImagesPipeline(ImagesPipeline):


def image_key(self, url, item):
    name = item['image_name']
    return 'full/%s.jpg' % (name)


def get_media_requests(self, item, info):
    print'Entered get_media_request'
    for image_url in item['image_urls']:
        yield Request(image_url)


Image_spider.py


Image_spider.py

 def getImage(self, response):
 item = JellyfishItem()
 item['image_urls']= [response.url]
 item['image_name']= response.meta['image_name']
 return item

我需要在代码中做哪些更改??

What are the changes that i need to do in my code ??

更新 1

pipelines.py

pipelines.py

class jellyImagesPipeline(ImagesPipeline):

    def image_custom_key(self, response):
        print '

 image_custom_key 

'
        name = response.meta['image_name'][0]
        img_key = 'full/%s.jpg' % (name)
        print "custom image key:", img_key
        return img_key
        
    def get_images(self, response, request, info):
        print "

 get_images 

"
        for key, image, buf, in super(jellyImagesPipeline, self).get_images(response, request, info):
            yield key, image, buf

        
        key = self.image_custom_key(response)
        orig_image = Image.open(StringIO(response.body))
        image, buf = self.convert_image(orig_image)
        yield key, image, buf
   
    def get_media_requests(self, item, info):
        print "

get_media_requests
"
        return [Request(x, meta={'image_name': item["image_name"]})
                for x in item.get('image_urls', [])]

更新 2

def image_key(self, image_name):
print 'entered into image_key'
    name = 'homeshop/%s.jpg' %(image_name)
    print name
    return name
    
def get_images(self,request):
    print '
Entered into get_images'
    key = self.image_key(request.url)
yield key 

def get_media_requests(self, item, info):
print '

Entered media_request'
print item['image_name']
    yield Request(item['image_urls'][0], meta=dict(image_name=item['image_name']))

def item_completed(self, results, item, info):
    print '

entered into item_completed
'
print 'Name : ', item['image_urls']
print item['image_name']
for tuple in results:
    print tuple

                 
        

推荐答案

pipelines.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request
from PIL import Image
from cStringIO import StringIO
import re

class jellyImagesPipeline(ImagesPipeline):

    CONVERTED_ORIGINAL = re.compile('^full/[0-9,a-f]+.jpg$')

    # name information coming from the spider, in each item
    # add this information to Requests() for individual images downloads
    # through "meta" dictionary
    def get_media_requests(self, item, info):
        print "get_media_requests"
        return [Request(x, meta={'image_name': item["image_name"]})
                for x in item.get('image_urls', [])]

    # this is where the image is extracted from the HTTP response
    def get_images(self, response, request, info):
        print "get_images"

        for key, image, buf, in super(jellyImagesPipeline, self).get_images(response, request, info):
            if self.CONVERTED_ORIGINAL.match(key):
                key = self.change_filename(key, response)
            yield key, image, buf

    def change_filename(self, key, response):
        return "full/%s.jpg" % response.meta['image_name'][0]

settings.py 中,确保你有

ITEM_PIPELINES = ['jelly.pipelines.jellyImagesPipeline']
IMAGES_STORE = '/path/to/where/you/want/to/store/images'

示例蜘蛛:从 Python.org 的主页获取图像,保存图像的名称(和路径)将遵循站点结构,即在名为 www.python.org

Example spider: Get images from Python.org's homepage, name (and path) of saved images will follow the site structure, i.e. in a folder called www.python.org

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import urlparse

class CustomItem(Item):
    image_urls = Field()
    image_names = Field()
    images = Field()

class ImageSpider(BaseSpider):
    name = "customimg"
    allowed_domains = ["www.python.org"]
    start_urls = ['http://www.python.org']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//img')
        items = []
        for site in sites:
            item = CustomItem()
            item['image_urls'] = [urlparse.urljoin(response.url, u) for u in site.select('@src').extract()]
            # the name information for your image
            item['image_name'] = ['whatever_you_want']
            items.append(item)
        return items

这篇关于Scrapy:使用重命名默认图像名称自定义图像管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆