如何避免在Scrapy中将媒体重新下载到S3? [英] How to avoid re-downloading media to S3 in Scrapy?

查看:55
本文介绍了如何避免在Scrapy中将媒体重新下载到S3?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前曾问过类似的问题(Scrapy如何避免重新下载最近下载的媒体?),但是由于我没有收到明确的答案,因此我将再次询问.

I previously asked a similar question (How does Scrapy avoid re-downloading media that was downloaded recently?), but since I did not receive a definite answer I'll ask it again.

我已使用Scrapy的文件管道将大量文件下载到AWS S3存储桶中.根据文档( https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images ),此管道可避免重新下载原本最近下载",但没有说明最近"是多久以前或如何设置此参数.

I've downloaded a large number of files to an AWS S3 bucket using Scrapy's Files Pipeline. According to the documentation (https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images), this pipeline avoids "re-downloading media that was downloaded recently", but it does not say how long ago "recent" is or how to set this parameter.

https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py ,这似乎是从 FILES_EXPIRES 获得的设置,默认值为90天:

Looking at the implementation of the FilesPipeline class at https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py, it would appear that this is obtained from the FILES_EXPIRES setting, for which the default is 90 days:

class FilesPipeline(MediaPipeline):
    """Abstract pipeline that implement the file downloading
    This pipeline tries to minimize network transfers and file processing,
    doing stat of the files and determining if file is new, uptodate or
    expired.
    `new` files are those that pipeline never processed and needs to be
        downloaded from supplier site the first time.
    `uptodate` files are the ones that the pipeline processed and are still
        valid files.
    `expired` files are those that pipeline already processed but the last
        modification was made long time ago, so a reprocessing is recommended to
        refresh it in case of change.
    """

    MEDIA_NAME = "file"
    EXPIRES = 90
    STORE_SCHEMES = {
        '': FSFilesStore,
        'file': FSFilesStore,
        's3': S3FilesStore,
    }
    DEFAULT_FILES_URLS_FIELD = 'file_urls'
    DEFAULT_FILES_RESULT_FIELD = 'files'

    def __init__(self, store_uri, download_func=None, settings=None):
        if not store_uri:
            raise NotConfigured

        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)

        cls_name = "FilesPipeline"
        self.store = self._get_store(store_uri)
        resolve = functools.partial(self._key_for_pipe,
                                    base_class_name=cls_name,
                                    settings=settings)
        self.expires = settings.getint(
            resolve('FILES_EXPIRES'), self.EXPIRES
        )
        if not hasattr(self, "FILES_URLS_FIELD"):
            self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
        if not hasattr(self, "FILES_RESULT_FIELD"):
            self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
        self.files_urls_field = settings.get(
            resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
        )
        self.files_result_field = settings.get(
            resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
        )

        super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)

    @classmethod
    def from_settings(cls, settings):
        s3store = cls.STORE_SCHEMES['s3']
        s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
        s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
        s3store.POLICY = settings['FILES_STORE_S3_ACL']

        store_uri = settings['FILES_STORE']
        return cls(store_uri, settings=settings)

    def _get_store(self, uri):
        if os.path.isabs(uri):  # to support win32 paths like: C:\\some\dir
            scheme = 'file'
        else:
            scheme = urlparse(uri).scheme
        store_cls = self.STORE_SCHEMES[scheme]
        return store_cls(uri)

    def media_to_download(self, request, info):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

我理解正确吗?另外,我在 S3FilesStore 类中没有看到带有 age_days 的类似布尔语句;是否还对S3上的文件执行了年龄检查?(我也找不到任何测试该S3年龄检查功能的测试.)

Do I understand this correctly? Also, I do not see a similar Boolean statement with age_days in the S3FilesStore class; is the checking of age also implemented for files on S3? (I was also unable to find any tests testing this age-checking feature for S3).

推荐答案

FILES_EXPIRES 确实是一种设置,用于告诉FilesPipeline在再次下载文件之前,文件有多旧".

FILES_EXPIRES is indeed the setting to tell the FilesPipeline how "old" can a file be before downloading it (again).

代码的关键部分在 media_to_download : _onsuccess 回调检查管道的 self.store.stat_file 调用的结果,对于您的问题,它特别查找"last_modified"信息.如果上次修改的时间早于过期日期",则会触发下载.

The key section of the code is in media_to_download: the _onsuccess callback checks the result of the pipeline's self.store.stat_file call, and for your question, it especially looks for the "last_modified" info. If last modified is older than "expires days", then the download is triggered.

您可以检查 S3store如何获取最后修改"信息.这取决于是否有botocore.

You can check how the S3store gets the "last modified" information. It depends if botocore is available or not.

这篇关于如何避免在Scrapy中将媒体重新下载到S3?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆