了解重命名图像scrapy的工作原理 [英] Understandin how rename images scrapy works

查看:50
本文介绍了了解重命名图像scrapy的工作原理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里看到了所有问题,但我还不明白.

I see all questions here, but i dont understand yet.

实际上用下面的代码我做我需要的,除了重命名图片,所以我尝试在 items.py 文件中更改名称,请检查里面的注释.

Actualy with de code bellow i do what i need, except rename de image, so i try change name in the items.py file, pls check comments inside.

settings.py

SPIDER_MODULES = ['xxx.spiders']
NEWSPIDER_MODULE = 'xxx.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = '/home/magicnt/xxx/images'

items.py

class XxxItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    image_urls = scrapy.Field()
    #images = scrapy.Field()<---with that code work with default name images
    images = title<--- I try rename here, but not work

蜘蛛.py

from xxx.items import XxxItem
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class CoverSpider(scrapy.Spider):
    name = "pyimagesearch-cover-spider"
    start_urls = ['https://xxx.com.br/product']
    def parse(self, response):
        for bimb in response.css('#mod_imoveis_result'):
            imageURL = bimb.xpath('./div[@id="g-img-imo"]/div[@class="img_p_results"]/img/@src').extract_first()
            title = bimb.css('#titulo_imovel::text').extract_first()
            yield {
                'image_urls' : [response.urljoin(imageURL)],
                'title' : title
            }
        next_page = response.xpath('//a[contains(@class, "num_pages") and contains(@class, "pg_number_next")]/@href').extract_first()
        yield response.follow(next_page, self.parse)

我的目标是使用 item 的标题重命名下载的图像.欢迎任何有关此目标的提示.

My goal is rename downloaded images with the title from item. Any tip for this goal are welcome.

我对 python 和 oo 完全陌生,我通常使用结构化的 php 进行爬取,但意识到它可以是一个很好的爬取,请求一点耐心和帮助.

I'm totally new to python and oo, I usually scrape with structural php but realize what a good scrapy it can be, ask for a little patience and help.

推荐答案

我的代码基于 Scrapy Image Pipeline:如何重命名图像? 我一周前测试过它,它适用于我自己的蜘蛛.

My code is based on Scrapy Image Pipeline: How to rename images? I tested it a week ago and it works on my own spiders.

# This pipeline is designed for an item with multiple images
class ImagesWithNamesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        # values in field "image_name" must have suffix ".jpg"
        # you can only change "image_name" to your own image name filed "images"
        # however it should be a list
        for (image_url, image_name) in zip(item[self.IMAGES_URLS_FIELD], item["image_names"]):
            yield scrapy.Request(url=image_url, meta={"image_name": image_name})

    def file_path(self, request, response=None, info=None):
        image_name = request.meta["image_name"]
        return image_name

<小时>

以下是 ImagePipeline 的工作原理:

管道将按顺序执行 image_downloaded -> get_images -> file_path.("->" 表示调用)

The pipeline will execute image_downloaded -> get_images -> file_path in order. ("->" means invokes)

  • image_downloaded:通过调用persist_file
  • 保存get_images返回的图片
  • get_images:将图像转换为 JPEG
  • file_path:返回图片的相对路径
  • image_downloaded: save images that get_images return by invoking persist_file
  • get_images: convert images to JPEG
  • file_path: return the relative path of image

我浏览了ImagePipeline 的源代码 并没有发现用于重命名图像的特殊字段.Scrapy 会以这种方式重命名:

I scaned through the source code of ImagePipeline and found no special field for rename an image. Scrapy will rename it in this way:

def file_path(self, request, response=None, info=None):
    image_guid = hashlib.sha1(to_bytes(url)).hexdigest()  # change to request.url after deprecation
    return 'full/%s.jpg' % (image_guid)

因此我们应该重写方法file_path.根据ImagePipeline 继承的 FilePipeline 的源代码,我们只需要返回相对路径persist_file就可以搞定.

Therefore we should override method file_path. According to the source code of FilePipeline which ImagePipeline inherits, we only need to return relative paths and persist_file will get things done.

这篇关于了解重命名图像scrapy的工作原理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆