用scrapy下载图片 [英] Downloading pictures with scrapy

查看:39
本文介绍了用scrapy下载图片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从scrapy开始,我遇到了第一个真正的问题.它正在下载图片.所以这是我的蜘蛛.

I'm starting with scrapy, and I have first real problem. It's downloading pictures. So this is my spider.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from example.items import ProductItem
from scrapy.utils.response import get_base_url

import re

class ProductSpider(CrawlSpider):
    name = "product"
    allowed_domains = ["domain.com"]
    start_urls = [
            "http://www.domain.com/category/supplies/accessories.do"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        sites = hxs.select('//td[@class="thumbtext"]')
        number = 0
        for site in sites:
            item = ProductItem()
            xpath = '//div[@class="thumb"]/img/@src'
            item['image_urls'] = site.select(xpath).extract()[number]
            item['image_urls'] = 'http://www.domain.com' + item['image_urls']
            items.append(item)
            number = number + 1
        return items

当我在 settings.py 中引用 ITEM_PIPELINESIMAGES_STORE 时,我会以这种方式获得我想要下载的图片的正确 URL(复制粘贴进入浏览器查看).

When I quote ITEM_PIPELINES and IMAGES_STORE in settings.py this way I get the proper URL for picture I want to download (copy pasted it into browser for check).

但是当我取消引用那些时,我收到以下错误:

But when I unquote those i get following error:

raise ValueError('Missing scheme in request url: %s' % self._url')
exceptions.ValueError: Missing scheme in request url:h

而且我无法下载我的照片.

and I can't download my pictures.

我已经搜索了一整天,但没有找到任何有用的信息.

I've searched for the whole day and didn't find anything helpful.

推荐答案

我认为您抓取的图像 URL 是相对的.要构造绝对 URL,请使用 urlparse.urljoin:

I think the image URL you scraped is relative. To construct the absolute URL use urlparse.urljoin:

def parse(self, response):
    ...
    image_relative_url = hxs.select("...").extract()[0]
    import urlparse
    image_absolute_url = urlparse.urljoin(response.url, image_relative_url.strip())
    item['image_urls'] = [image_absolute_url]
    ...

<小时>

尚未使用 ITEM_PIPELINES,但 文档说:

在 Spider 中,您抓取一个项目并将其图像的 URL 放入 image_urls 字段.

In a Spider, you scrape an item and put the URLs of its images into a image_urls field.

因此,item['image_urls'] 应该是一个图像 URL 列表.但是你的代码有:

So, item['image_urls'] should be a list of image URLs. But your code has:

item['image_urls'] = 'http://www.domain.com' + item['image_urls']

所以,我猜它会逐个字符地迭代您的单个 URL - 使用每个字符作为 URL.

So, i guess it iterates your single URL char by char - using each as URL.

这篇关于用scrapy下载图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆