如何使用文件管道使用 Python/Scrapy 下载 (PDF) 文件? [英] How to download (PDF) files with Python/Scrapy using the Files Pipeline?

查看:39
本文介绍了如何使用文件管道使用 Python/Scrapy 下载 (PDF) 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Windows 10 上使用 Python 3.7.2 我正在努力完成让 Scrapy v1.5.1 下载一些 PDF 文件的任务.我遵循了 docs 但我似乎错过了一些东西.Scrapy 为我提供了所需的 PDF URL但不下载任何内容.也不会抛出任何错误(至少).

Using Python 3.7.2 on Windows 10 I'm struggling with the task to let Scrapy v1.5.1 download some PDF files. I followed the docs but I seem to miss something. Scrapy gets me the desired PDF URLs but downloads nothing. Also no errors are thrown (at least).

相关代码为:

scrapy.cfg:

scrapy.cfg:

[settings]
default = pranger.settings

[deploy]
project = pranger

settings.py:

settings.py:

BOT_NAME = 'pranger'

SPIDER_MODULES = ['pranger.spiders']
NEWSPIDER_MODULE = 'pranger.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
    'pranger.pipelines.PrangerPipeline': 300,
    'scrapy.pipelines.files.FilesPipeline': 1,
}

FILES_STORE = r'C:\pranger_downloaded'

FILES_URLS_FIELD = 'PDF_urls'
FILES_RESULT_FIELD = 'processed_PDFs'

panger_spider.py:

pranger_spider.py:

import scrapy

class IndexSpider(scrapy.Spider):
    name = "index"
    url_liste = []

    def start_requests(self):
        urls = [
            'http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for menupunkt in response.css('div#aufklappmenue'):
            yield {
                'file_urls': menupunkt.css('div.aussen a.innen::attr(href)').getall()
            }

items.py:

import scrapy    

class PrangerItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

所有其他文件都是由 scrapy startproject 命令创建的.
scrapy爬取索引的输出为:

All other files are as they were created by the scrapy startproject command.
The output of scrapy crawl index is:

(pranger) C:\pranger>scrapy crawl index
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: pranger)
2019-02-20 15:45:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Feb 11 2019, 14:11:50) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-02-20 15:45:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pranger', 'NEWSPIDER_MODULE': 'pranger.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pranger.spiders']}
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-20 15:45:18 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline', 'pranger.pipelines.PrangerPipeline']
2019-02-20 15:45:18 [scrapy.core.engine] INFO: Spider opened
2019-02-20 15:45:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-20 15:45:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://verbraucherinfo.ua-bw.de/robots.txt> (referer: None)
2019-02-20 15:45:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3> (referer: None)
2019-02-20 15:45:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://verbraucherinfo.ua-bw.de/lmk.asp?ref=3>
{'file_urls': ['https://www.lrabb.de/site/LRA-BB-Desktop/get/params_E-428807985/3287025/Ergebnisse_amtlicher_Kontrollen_nach_LFGB_Landkreis_Boeblingen.pdf', <<...and dozens more URLs...>>], 'processed_PDFs': []}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-20 15:45:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 469,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 13268,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 20, 14, 45, 19, 166646),
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 2, 20, 14, 45, 18, 864509)}
2019-02-20 15:45:19 [scrapy.core.engine] INFO: Spider closed (finished)

哦顺便说一句,我发布了代码,以防万一:https://github.com/R0byn/pranger/tree/5bfa0df92f21cecee18cc618e9a8e7ceea192403

Oh BTW I published the code, just in case: https://github.com/R0byn/pranger/tree/5bfa0df92f21cecee18cc618e9a8e7ceea192403

推荐答案

FILES_URLS_FIELD 设置告诉管道项目的哪个字段包含您要下载的网址.

The FILES_URLS_FIELD setting tells the pipeline what field of the item contains the urls you want to download.

默认情况下,这是 file_urls,但如果您更改设置,您还需要更改存储 url 的字段名称(键).

By default, this is file_urls, but if you change the setting, you also need to change the field name (key) you're storing the urls in.

因此您有两个选择 - 使用默认设置,或将您的项目字段重命名为 PDF_urls.

So you have two options - either use the default setting, or rename your item's field to PDF_urls as well.

这篇关于如何使用文件管道使用 Python/Scrapy 下载 (PDF) 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆