使用 Scrapy:如何从一些提取的链接下载 pdf 文件 [英] Using Scrapy: how to download pdf files from some extracted links

查看:43
本文介绍了使用 Scrapy:如何从一些提取的链接下载 pdf 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经创建了从网站(PDF 链接)中提取一些链接的代码,现在我需要下载这些 PDF 文件,但我正在为如何做到这一点而苦苦挣扎.这是代码:

I've created code that extracts some links from a website (PDF links,) and now I need to download these PDF files, but I am struggling with how to do that. This is the code:


    import scrapy


    class all5(scrapy.Spider):
        name = "all5"
        start_urls = [
          'https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii',
        ]

        def parse(self, response):


            for link in response.css('.default .er').xpath('@href').extract():
                 url=response.url
                 path=response.css('ol.breadcrumb li a::text').extract()
                 next_link = response.urljoin(link)
                 yield scrapy.Request(next_link,callback=self.parse_det,meta={'url' : url,'path':path})

        def parse_det(self, response):

            def extract_with_css(query):
                return response.css(query).get(default='').strip()

            yield {
                'path':response.meta['path'],
                'finallink': extract_with_css('a.btn.btn-primary::attr(href)'),
                'url':response.meta['url']

                }


我需要下载的链接是finallink".

The links that I need to download are "finallink".

我该如何解决问题?

推荐答案

在设置中你必须激活管道

In settings you have to activate pipeline

ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}

和下载文件的文件夹

'FILES_STORE' = '.'

它将下载到 FILES_STORE/full

当你产生数据时,你必须使用名称files_url

And you have to use name files_url when you yield data

yield {
    'file_urls': [extract_with_css('a.btn.btn-primary::attr(href)')]
    # ... rest ...
}

即使你有一个文件要下载,它也必须是列表.

It has to be list even if you has one file to download.

它应该将 PDF 下载到您在 files 字段中的数据中获得的具有唯一名称的文件中

It should download PDFs to files with unique names which you get in data in field files

Scrapy 文档:下载和处理文件和图像

独立代码 - 您无需创建项目即可复制和运行.

standalone code - you can copy and run without creating project.

#!/usr/bin/env python3

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    start_urls = [
          'https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii',
    ]

    def parse(self, response):

        for link in response.css('.default .er').xpath('@href').extract():
             url = response.url
             path = response.css('ol.breadcrumb li a::text').extract()
             next_link = response.urljoin(link)
             yield scrapy.Request(next_link, callback=self.parse_det, meta={'url': url, 'path': path})

    def parse_det(self, response):

        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'path':response.meta['path'],
            'file_urls': [extract_with_css('a.btn.btn-primary::attr(href)')],
            'url':response.meta['url']
        }


from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file as CSV, JSON or XML
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', # 

    # download files to `FILES_STORE/full`
    # it needs `yield {'file_urls': [url]}` in `parse()`
    'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
    'FILES_STORE': '.',
})
c.crawl(MySpider)
c.start()

这篇关于使用 Scrapy:如何从一些提取的链接下载 pdf 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆