在不使用 item.py 的情况下无法通过管道重命名下载的图像 [英] Unable to rename downloaded images through pipelines without the usage of item.py

查看:49
本文介绍了在不使用 item.py 的情况下无法通过管道重命名下载的图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用 python 的 scrapy 模块创建了一个脚本,用于从 Torrent 站点的多个页面下载和重命名电影图像,并将它们存储在桌面文件夹中.当将这些图像下载和存储在桌面文件夹中时,我的脚本完全相同.但是,我现在正在努力做的是即时重命名这些文件.由于我没有使用 item.py 文件,我也不希望使用,我几乎不明白 pipelines.py 文件的逻辑将如何处理重命名过程.

I've created a script using python's scrapy module to download and rename movie images from multiple pages out of a torrent site and store them in a desktop folder. When it is about downloading and storing those images in a desktop folder, my script is the same errorlessly. However, what I'm struggling to do now is rename those files on the fly. As I didn't make use of item.py file and I do not wish to either, I hardly understand how the logic of pipelines.py file would be to handle the renaming process.

我的蜘蛛(它可以完美地下载图像):

from scrapy.crawler import CrawlerProcess
import scrapy, os

class YifySpider(scrapy.Spider):
    name = "yify"

    allowed_domains = ["www.yify-torrent.org"]
    start_urls = ["https://www.yify-torrent.org/search/1080p/p-{}/".format(page) for page in range(1,5)]

    custom_settings = {
        'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1},
        'IMAGES_STORE': r"C:\Users\WCS\Desktop\Images",
    }

    def parse(self, response):
        for link in response.css("article.img-item .poster-thumb::attr(src)").extract():
            img_link = response.urljoin(link)
            yield scrapy.Request(img_link, callback=self.get_images)

    def get_images(self, response):
        yield {
            'image_urls': [response.url],
        }

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',   
    })
    c.crawl(YifySpider)
    c.start()

pipelines.py 包含:(以下几行是占位符,让您知道我至少尝试过):

from scrapy.http import Request

class YifyPipeline(object):

    def file_path(self, request, response=None, info=None):
        image_name = request.url.split('/')[-1]
        return image_name

    def get_media_requests(self, item, info):
        yield Request(item['image_urls'][0], meta=item)

如何在不使用item.py的情况下通过pipelines.py重命名图像?

How can I rename the images through pipelines.py without the usage of item.py?

推荐答案

你需要对原来的ImagesPipeline进行子类化:

You need to subclass the original ImagesPipeline:

from scrapy.pipelines.images import ImagesPipeline

class YifyPipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None):
        image_name = request.url.split('/')[-1]
        return image_name

然后在您的设置中引用它:

And then refer to it in your settings:

custom_settings = {
    'ITEM_PIPELINES': {'my_project.pipelines.YifyPipeline': 1},
}

但请注意,当不同文件具有相同名称时,简单的使用确切的文件名"想法会导致问题,除非您向文件名添加唯一的文件夹结构或附加组件.这是默认使用基于校验和的文件名的原因之一.参考原file_path,以防您想包含一些原始逻辑来防止这种情况发生.

But be aware that a simple "use the exact filename" idea will cause issues when different files have the same name, unless you add a unique folder structure or an additional component to the filename. That's one reason checksums-based filenames are used by default. Refer to the original file_path, in case you want to include some of the original logic to prevent that.

这篇关于在不使用 item.py 的情况下无法通过管道重命名下载的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆