Scrapy图片管道无法下载图片 [英] Scrapy image pipeline does not download images

查看:481
本文介绍了Scrapy图片管道无法下载图片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Scrapy Framework和djano-item设置从网页下载图像.我想我已经完成了 doc中的所有操作 但是在调用scrapy爬网之后,我的日志看起来像这样:

I'm trying to set up image downloading from web pages by using Scrapy Framework and djano-item. I think I have done everything like in doc but after calling scrapy crawl I log looking like this:

抓取日志

我找不到有关发生问题的任何信息,但图像"字段为空,目录不包含任何图像.

I can't find there any information on what went wrong but Images field Is empty and directory does not contain any images.

这是我的模特

class Event(models.Model):
    title = models.CharField(max_length=100, blank=False)
    description = models.TextField(blank=True, null=True)
    event_location = models.CharField(max_length=100, blank = True, null= True)
    image_urls = models.CharField(max_length = 200, blank = True, null = True)
    images = models.CharField(max_length=100, blank = True, null = True)
    url = models.URLField(max_length=200)

    def __unicode(self):
        return self.title

这就是我从蜘蛛到图像管道的方式

and this is how i go from spider to image pipeline

def parse_from_details_page(self, response):
    "Some code"
    item_event = item_loader.load_item()
    #this is to create image_urls list (there is only one image_url allways)
    item_event['image_urls'] = [item_event['image_urls'],]
    return item_event

最后这是我的Scrapy项目的settings.py:

and finally this is my settings.py for Scrapy project:

import sys
import os
import django

DJANGO_PROJECT_PATH = os.path.join(os.path.dirname((os.path.abspath(__file__))), 'MyScrapy')
#sys.path.insert(0, DJANGO_PROJECT_PATH)
#sys.path.append(DJANGO_PROJECT_PATH)
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "MyScrapy.settings")
#os.environ["DJANGO_SETTINGS_MODULE"] = "MyScrapy.settings"


django.setup()

BOT_NAME = 'EventScraper'

SPIDER_MODULES = ['EventScraper.spiders']
NEWSPIDER_MODULE = 'EventScraper.spiders'

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 100,
    'EventScraper.pipelines.EventscraperPipeline': 200,
}

#MEDIA STORAGE URL
IMAGES_STORE = os.path.join(DJANGO_PROJECT_PATH, "IMAGES")

#IMAGES (used to be sure that it takes good fields)
FILES_URLS_FIELD = 'image_urls'
FILES_RESULT_FIELD = 'images'

预先感谢您的帮助

我使用了来自文档的自定义图像管道,如下所示,

I used custom image pipeline from doc looking like this,

class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        import ipdb; ipdb.set_trace()
        yield scrapy.Request(image_url)

def item_completed(self, results, item, info):
    import ipdb; ipdb.set_trace()
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

在get_media_requests中,它向我的网址创建请求,但在结果参数中的item_completed中,我得到了类似以下内容:[(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: >)] 我仍然不知道该如何解决. 问题是否可能是由使用https引用地址引起的?

In get_media_requests it creates request to my Url but in item_completed in result param i get somethin like this : [(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: >)] I still don't know how to fix it. Is it possible that the problem could be caused by a reference to the address with https ?

推荐答案

我遇到了刮擦的确切问题. 我的解决方案:

I faced the EXACT issue with scrapy. My Solution:

在您要在get_media_requests函数中产生的请求中添加了标头.我添加了一个用户代理和一个主机以及其他一些标题.这是我的标题列表.

Added headers to the request you're yielding in the get_media_requests function. I added a user agent and a host along with some other headers. Here's my list of headers.

headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, sdch',
            'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Proxy-Connection': 'keep-alive',
            'Pragma': 'no-cache',
            'Cache-Control': 'no-cache',
            'Host': 'images.finishline.com',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
        }

在浏览器中打开确切的图像URL(用于下载图像的URL).只需检查浏览器的网络"标签以获取标题列表.确保您上面提到的该请求的标头与那些标头相同.

Open up the exact image url in your browser (the url with which you're downloading the image). Simply check your browser's network tab for the list of headers. Make sure your headers for that request I mentioned above are the same as those.

希望它能起作用.

这篇关于Scrapy图片管道无法下载图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆