如何将scrapy图像下载到动态文件夹中? [英] How to download scrapy images in to a dynamic folder?

查看:53
本文介绍了如何将scrapy图像下载到动态文件夹中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以通过scrapy 将图像下载到Full"文件夹中,但是每次scrapy 运行时我都需要使目标文件夹的名称动态化,例如full/session_id.

I am able to download images through scrapy in to the "Full" folder but I need to make the name of the destination folder dynamic, like full/session_id, every time scrapy runs.

有没有办法做到这一点?

Is there any way to do this?

推荐答案

我还没有使用 ImagesPipeline,但是 遵循文档,我'd 覆盖 item_completed(results, items, info).

I have not worked with the ImagesPipeline yet, but following the documentation, I'd override item_completed(results, items, info).

原来的定义是:

def item_completed(self, results, item, info):
    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

这应该会为您提供下载图像的结果集,包括路径(似乎一个项目上可以有很多图像).

This should give you the result sets of the downloaded images including the path (seems there can be many images on one item).

如果您现在在子类中更改此方法以在设置路径之前移动所有文件,它应该可以正常工作.您可以在您的项目上设置目标文件夹,例如 item['session_path'].您必须在每个项目上设置此设置,然后才能从蜘蛛中返回/提供您的项目.

If you now change this method in a subclass to move all files before setting the path, it should work as you want. You could set the target folder on your item in something like item['session_path']. You'd have to set this setting on each item, before returning/yielding your items from the spider.

具有重写方法的子类可能如下所示:

The subclass with overriden method could then look like this:

import os, os.path
from scrapy.contrib.pipeline.images import ImagesPipeline

class SessionImagesPipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        # iterate over the local file paths of all downloaded images
        for result in [x for ok, x in results if ok]:
            path = result['path']
            # here we create the session-path where the files should be in the end
            # you'll have to change this path creation depending on your needs
            target_path = os.path.join((item['session_path'], os.basename(path)))

            # try to move the file and raise exception if not possible
            if not os.rename(path, target_path):
                raise ImageException("Could not move image to target folder")

            # here we'll write out the result with the new path,
            # if there is a result field on the item (just like the original code does)
            if self.IMAGES_RESULT_FIELD in item.fields:
                result['path'] = target_path
                item[self.IMAGES_RESULT_FIELD].append(result)

        return item

更好的做法是不在 item 中设置所需的会话路径,而是在您的 scrapy 运行期间的配置中设置.为此,您必须了解如何在应用程序运行时设置配置,并且我认为您必须覆盖构造函数.

Even nicer would be to set the desired session path not in the item, but in the configuration during your scrapy run. For this, you would have to find out how to set config while the application is running and you'd have to override the constructor, I think.

这篇关于如何将scrapy图像下载到动态文件夹中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆