在 Scrapy 中提取图像 [英] Extracting Images in Scrapy

查看:44
本文介绍了在 Scrapy 中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在这里阅读了其他一些答案,但我遗漏了一些基本的东西.我正在尝试使用 CrawlSpider 从网站中提取图像.

I've read through a few other answers here but I'm missing something fundamental. I'm trying to extract the images from a website with a CrawlSpider.

settings.py

settings.py

BOT_NAME = 'healthycomm'

SPIDER_MODULES = ['healthycomm.spiders']
NEWSPIDER_MODULE = 'healthycomm.spiders'

ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1}
IMAGES_STORE = '~/Desktop/scrapy_nsml/healthycomm/images'

items.py

class HealthycommItem(scrapy.Item):
    page_heading = scrapy.Field()
    page_title = scrapy.Field()
    page_link = scrapy.Field()
    page_content = scrapy.Field()
    page_content_block = scrapy.Field()

    image_url = scrapy.Field()
    image = scrapy.Field()

HealthycommSpider.py

HealthycommSpider.py

class HealthycommSpiderSpider(CrawlSpider):
    name = "healthycomm_spider"
    allowed_domains = ["healthycommunity.org.au"]
    start_urls = (
        'http://www.healthycommunity.org.au/',
    )
    rules = (Rule(SgmlLinkExtractor(allow=()), callback="parse_items", follow=False), ) 


    def parse_items(self, response):
        content = Selector(response=response).xpath('//body')
        for nodes in content:

            img_urls = nodes.xpath('//img/@src').extract()

            item = HealthycommItem()
            item['page_heading'] = nodes.xpath("//title").extract()
            item["page_title"] = nodes.xpath("//h1/text()").extract()
            item["page_link"] = response.url
            item["page_content"] = nodes.xpath('//div[@class="CategoryDescription"]').extract()
            item['image_url'] = img_urls 
            item['image'] = ['http://www.healthycommunity.org.au' + img for img in img_urls]

            yield item

总的来说,我对 Python 不是很熟悉,但我觉得这里缺少一些非常基本的东西.

I'm not very familiar with Python in general, but I feel like I'm missing something very basic here.

谢谢,杰米

推荐答案

如果你想使用标准的 ImagesPipeline,你需要将你的 parse_items 方法更改为类似:

If you want to use the standard ImagesPipeline, you need to change your parse_items method to something like:

import urlparse
...

    def parse_items(self, response):
        content = Selector(response=response).xpath('//body')
        for nodes in content:

            # build absolute URLs
            img_urls = [urlparse.urljoin(response.url, src)
                        for src in nodes.xpath('//img/@src').extract()]

            item = HealthycommItem()
            item['page_heading'] = nodes.xpath("//title").extract()
            item["page_title"] = nodes.xpath("//h1/text()").extract()
            item["page_link"] = response.url
            item["page_content"] = nodes.xpath('//div[@class="CategoryDescription"]').extract()

            # use "image_urls" instead of "image_url"
            item['image_urls'] = img_urls 

            yield item

您的项目定义需要images"和image_urls"字段(复数,非单数)

And your item definition needs "images" and "image_urls" fields (plural, not singular)

另一种方法是设置 IMAGES_URLS_FIELDIMAGES_RESULT_FIELD 以适合您的项目定义

The other way is to set IMAGES_URLS_FIELD and IMAGES_RESULT_FIELD to fit your item definition

这篇关于在 Scrapy 中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆