如何避免在Scrapy中将媒体重新下载到S3? [英] How to avoid re-downloading media to S3 in Scrapy?
问题描述
我之前曾问过类似的问题(Scrapy如何避免重新下载最近下载的媒体?),但是由于我没有收到明确的答案,因此我将再次询问.
I previously asked a similar question (How does Scrapy avoid re-downloading media that was downloaded recently?), but since I did not receive a definite answer I'll ask it again.
我已使用Scrapy的文件管道将大量文件下载到AWS S3存储桶中.根据文档( https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images ),此管道可避免重新下载原本最近下载",但没有说明最近"是多久以前或如何设置此参数.
I've downloaded a large number of files to an AWS S3 bucket using Scrapy's Files Pipeline. According to the documentation (https://doc.scrapy.org/en/latest/topics/media-pipeline.html#downloading-and-processing-files-and-images), this pipeline avoids "re-downloading media that was downloaded recently", but it does not say how long ago "recent" is or how to set this parameter.
在 https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py ,这似乎是从 FILES_EXPIRES
获得的设置,默认值为90天:
Looking at the implementation of the FilesPipeline
class at https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py, it would appear that this is obtained from the FILES_EXPIRES
setting, for which the default is 90 days:
class FilesPipeline(MediaPipeline):
"""Abstract pipeline that implement the file downloading
This pipeline tries to minimize network transfers and file processing,
doing stat of the files and determining if file is new, uptodate or
expired.
`new` files are those that pipeline never processed and needs to be
downloaded from supplier site the first time.
`uptodate` files are the ones that the pipeline processed and are still
valid files.
`expired` files are those that pipeline already processed but the last
modification was made long time ago, so a reprocessing is recommended to
refresh it in case of change.
"""
MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'
def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
resolve = functools.partial(self._key_for_pipe,
base_class_name=cls_name,
settings=settings)
self.expires = settings.getint(
resolve('FILES_EXPIRES'), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
)
super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)
@classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']
s3store.POLICY = settings['FILES_STORE_S3_ACL']
store_uri = settings['FILES_STORE']
return cls(store_uri, settings=settings)
def _get_store(self, uri):
if os.path.isabs(uri): # to support win32 paths like: C:\\some\dir
scheme = 'file'
else:
scheme = urlparse(uri).scheme
store_cls = self.STORE_SCHEMES[scheme]
return store_cls(uri)
def media_to_download(self, request, info):
def _onsuccess(result):
if not result:
return # returning None force download
last_modified = result.get('last_modified', None)
if not last_modified:
return # returning None force download
age_seconds = time.time() - last_modified
age_days = age_seconds / 60 / 60 / 24
if age_days > self.expires:
return # returning None force download
我理解正确吗?另外,我在 S3FilesStore
类中没有看到带有 age_days
的类似布尔语句;是否还对S3上的文件执行了年龄检查?(我也找不到任何测试该S3年龄检查功能的测试.)
Do I understand this correctly? Also, I do not see a similar Boolean statement with age_days
in the S3FilesStore
class; is the checking of age also implemented for files on S3? (I was also unable to find any tests testing this age-checking feature for S3).
推荐答案
FILES_EXPIRES
确实是一种设置,用于告诉FilesPipeline在再次下载文件之前,文件有多旧".
FILES_EXPIRES
is indeed the setting to tell the FilesPipeline how "old" can a file be before downloading it (again).
代码的关键部分在 media_to_download
: _onsuccess
回调检查管道的 self.store.stat_file
调用的结果,对于您的问题,它特别查找"last_modified"信息.如果上次修改的时间早于过期日期",则会触发下载.
The key section of the code is in media_to_download
:
the _onsuccess
callback checks the result of the pipeline's self.store.stat_file
call, and for your question, it especially looks for the "last_modified" info. If last modified is older than "expires days", then the download is triggered.
您可以检查 S3store如何获取最后修改"信息.这取决于是否有botocore.
You can check how the S3store gets the "last modified" information. It depends if botocore is available or not.
这篇关于如何避免在Scrapy中将媒体重新下载到S3?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!