Scrapy:根据下载图像的 url 从下载的图像中创建文件夹结构 [英] Scrapy : create folder structure out of downloaded images based on the url from which images are being downloaded

查看:36
本文介绍了Scrapy:根据下载图像的 url 从下载的图像中创建文件夹结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组定义网站结构的链接.从这些链接下载图像时,我想同时将下载的图像放在类似于网站结构的文件夹结构中,而不仅仅是重命名(如 Scrapy 图片下载如何使用自定义文件名)

I have an array of links that define the structure of a website. While downloading images from these links, I want to simultaneously place the downloaded images in a folder structure similar to the website structure, and not just rename it (as answered in Scrapy image download how to use custom filename)

我的代码是这样的:

class MyImagesPipeline(ImagesPipeline):
    """Custom image pipeline to rename images as they are being downloaded"""
    page_url=None
    def image_key(self, url):
        page_url=self.page_url
        image_guid = url.split('/')[-1]
        return '%s/%s/%s' % (page_url,image_guid.split('_')[0],image_guid)

    def get_media_requests(self, item, info):
        #http://store.abc.com/b/n/s/m
        os.system('mkdir '+item['sku'][0].encode('ascii','ignore'))
        self.page_url = urlparse(item['start_url']).path #I store the parent page's url in start_url Field
        for image_url in item['image_urls']:
            yield Request(image_url)

它创建了所需的文件夹结构,但是当我深入地进入文件夹时,我发现文件在文件夹中放错了位置.

It creates the required folder structure but when I go into the folders in deapth, I see that the files have been misplaced in the folders.

我怀疑这是因为get_media_requests"和image_key"函数可能正在异步执行,因此page_url"的值在被image_key"函数使用之前发生了变化.

I'm suspecting that it is happening because the "get_media_requests" and "image_key" functions might be executing asynchronously hence the value of "page_url" changes before it is used by the "image_key" function.

推荐答案

异步项目处理阻止在管道内通过 self 使用类变量是绝对正确的.您必须在每个请求中存储您的路径并覆盖更多方法(未经测试):

You are absolutely right that asynchronous Item processing prevents using class variables via self within the pipeline. You will have to store your path in each Request and override a few more methods (untested):

def image_key(self, url, page_url):
    image_guid = url.split('/')[-1]
    return '%s/%s/%s' % (page_url, image_guid.split('_')[0], image_guid)

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        yield Request(image_url, meta=dict(page_url=urlparse(item['start_url']).path))

def get_images(self, response, request, info):
    key = self.image_key(request.url, request.meta.get('page_url'))
    ...

def media_to_download(self, request, info):
    ...
    key = self.image_key(request.url, request.meta.get('page_url'))
    ...

def media_downloaded(self, response, request, info):
    ...
    try:
        key = self.image_key(request.url, request.meta.get('page_url'))
    ...

这篇关于Scrapy:根据下载图像的 url 从下载的图像中创建文件夹结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆