Scrapy文件下载如何使用自定义文件名 [英] Scrapy file download how to use custom filename

查看:652
本文介绍了Scrapy文件下载如何使用自定义文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的 scrapy 项目,我目前正在使用

For my scrapy project I'm currently using the FilesPipeline. The downloaded files are stored with a SHA1 hash of their URLs as the file names.

[(True,
  {'checksum': '2b00042f7481c7b056c4b410d28f33cf',
   'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg',
   'url': 'http://www.example.com/files/product1.pdf'}),
 (False,
  Failure(...))]

如何使用自定义文件名存储文件?

在上面的示例中,我希望文件名为"product1_0a79c461a4062ac383dcdc4fade7bc09f1384a3910.pdf" ,以便保持唯一性,但使文件名可见.

In the example above, I would want the file name being "product1_0a79c461a4062ac383dc4fade7bc09f1384a3910.pdf" so I keep uniqueness but make the file name visible.

作为起点,我探索了项目的pipelines.py并没有获得很大的成功.

As a starting point, I explored the pipelines.py of my project without much success.

import scrapy
from scrapy.pipelines.images import FilesPipeline
from scrapy.exceptions import DropItem

class MyFilesPipeline(FilesPipeline):

    def file_path(self, request, response=None, info=None):
        return request.meta.get('filename','')

    def get_media_requests(self, item, info):
        file_url = item['file_url']
        meta = {'filename': item['name']}
        yield Request(url=file_url, meta=meta)

将此参数包含在我的settings.py

ITEM_PIPELINES = {
    #'scrapy.pipelines.files.FilesPipeline': 300
    'io_spider.pipelines.MyFilesPipeline': 200
}

提出了一个类似的问题,但它确实定位图像而不是文件.

A similar question has been asked but it does target images and not files.

任何帮助将不胜感激.

推荐答案

file_path应该返回文件的路径.在您的代码中,file_path返回item['name'],这将是文件的路径.请注意,默认情况下file_path 计算SHA1哈希.所以你的方法应该是这样的:

file_path should return the path to your file. In your code, file_path returns item['name'] and that will be your file's path. Note that by default file_path calculates SHA1 hashes. So your method should be something like this:

def file_path(self, request, response=None, info=None):
    original_path = super(MyFilesPipeline, self).file_path(request, response=None, info=None)
    sha1_and_extension = original_path.split('/')[1] # delete 'full/' from the path
    return request.meta.get('filename','') + "_" + sha1_and_extension

这篇关于Scrapy文件下载如何使用自定义文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆