使用Scrapy和Python3批量下载pdf [英] Bulk download of pdf with Scrapy and Python3

查看:115
本文介绍了使用Scrapy和Python3批量下载pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从尼加拉瓜国民议会的这个网站, Python3 / Scrapy

I would like to bulk download free-to-download pdfs (copies of an old newspaper from 1843 to 1900 called Gaceta) from this website of the Nicaraguan National Assembly with Python3/Scrapy.

我是编程和python的绝对初学者,但尝试从一个(n未完成)脚本开始:

I am a absolute beginner in programming and python, but tried to start with a(n unfinished) script:

#!/usr/bin/env python3

from urllib.parse import urlparse
import scrapy

from scrapy.http import Request

class gaceta(scrapy.Spider):
    name = "gaceta"

    allowed_domains = ["digesto.asamblea.gob.ni"]
    start_urls = ["http://digesto.asamblea.gob.ni/consultas/coleccion/"]

    def parse(self, response):
        for href in response.css('div#gridTableDocCollection::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

每个问题的链接都有些混乱,因此无法预期必须在源代码中搜索链接,例如,请参阅指向该报纸前四期的链接(并非每天都有发行的副本):

The link to each issue has some gibberish in it, so they cannot be anticipated and each link has to be searched within the source code, see for example the links to the first four available issues of the said newspaper (not every days there was a copy issued):

#06/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=nYgT5Rcvs2I%3D

#13/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=3sAxsKCA6Bo%3D

#28/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=137YSPeIXg8%3D

#08/08/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=aTvB%2BZpqoMw%3D

我的问题是我无法将有效的脚本放在一起。

My problem is that I cannot get a working script together.

我想让我的脚本执行以下操作:

I would like to have my script to:

a)搜索搜索后出现的表格中的每个pdf链接(称为在网站源代码 tableDocCollection中)。实际链接位于 Acciones按钮(第一个问题的xpath // * [@ id = tableDocCollection] / tbody / tr [1] / td [5] / div / ul / li [1] / a

a) search each pdf link within the table which appears after a search (called within the website source code "tableDocCollection"). The actual link sits behind the "Acciones" button (xpath of the first issue //*[@id="tableDocCollection"]/tbody/tr[1]/td[5]/div/ul/li[1]/a)

b)显示正在下载的问题的名称,该名称可以在 Acciones后面找到按钮(第一个期刊的显示名称的路径 // * [@ id = tableDocCollection] / tbody / tr [1] / td [5] / div / ul / li [ 2] / a )。

b) to display the name of the issue it is downloading and which can be found behind the "Acciones" button (path of the name to be displayed of the first issue //*[@id="tableDocCollection"]/tbody/tr[1]/td[5]/div/ul/li[2]/a).

在编写脚本时遇到的主要问题是:

The major problems I run into when writing the script are:

1),当我键入搜索内容时,网站的链接不会更改。因此,似乎我不得不告诉 Scrapy 插入适当的搜索词(复选标记Búsquedaavanzada,Colección:Dario Oficial, Medio dePublicación:La Gaceta ,时间间隔为 06/07/1843到31/12/1900)?

1) that the link of the website does not change when I type in the search. So it seems that I have to tell Scrapy to insert the appropriate search terms (check mark "Búsqueda avanzada", "Colección: Dario Oficial", "Medio de Publicación: La Gaceta", time interval "06/07/1843 to 31/12/1900")?

2)我不知道如何找到每个pdf链接?

2) that I do not know how each pdf link can be found?

如何更新上述脚本,以便可以下载范围为06/07/1843至31/12/1900的所有PDF?

How can I update the above script so that I can download all PDF in the range of 06/07/1843 to 31/12/1900?

编辑:

#!/usr/bin/env python3
from urllib.parse import urlparse
import scrapy

from scrapy.http import Request

frmdata = {"rdds":[{"rddid":"+1RiQw3IehE=","anio":"","fecPublica":"","numPublica":"","titulo":"","paginicia":null,"norma":null,"totalRegistros":"10"}
url = "http://digesto.asamblea.gob.ni/consultas/coleccion/"
r = FormRequest(url, formdata=frmdata)
fetch(r)

yield FormRequest(url, callback=self.parse, formdata=frmdata)


推荐答案

# -*- coding: utf-8 -*-
import errno
import json
import os

import scrapy
from scrapy import FormRequest, Request


class AsambleaSpider(scrapy.Spider):
    name = 'asamblea'
    allowed_domains = ['asamblea.gob.ni']
    start_urls = ['http://digesto.asamblea.gob.ni/consultas/coleccion/']

    papers = {
    #    "Diario de Circulación Nacional" : "176",
        "Diario Oficial": "28",
    #    "Obra Bibliográfica": "31",
    #    "Otro": "177",
    #    "Texto de Instrumentos Internacionales": "103"
    }

    def parse(self, response):

        for key, value in list(self.papers.items()):
            yield FormRequest(url='http://digesto.asamblea.gob.ni/consultas/util/ws/proxy.php',
                  headers= {
                      'X-Requested-With': 'XMLHttpRequest'
                  }, formdata= {
                        'hddQueryType': 'initgetRdds',
                        'cole': value
                    }
                    , meta={'paper': key},
                    callback=self.parse_rdds
                )
        pass

    def parse_rdds(self, response):
        data = json.loads(response.body_as_unicode())
        for r in data["rdds"]:
            r['paper'] = response.meta['paper']
            rddid = r['rddid']
            yield Request("http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=" + rddid,
                          callback=self.download_pdf, meta=r)

    def download_pdf(self, response):
       filename = "{paper}/{anio}/".format(**response.meta) + "{titulo}-{fecPublica}.pdf".format(**response.meta).replace("/", "_")
       if not os.path.exists(os.path.dirname(filename)):
           try:
               os.makedirs(os.path.dirname(filename))
           except OSError as exc:  # Guard against race condition
               if exc.errno != errno.EEXIST:
                   raise

       with open(filename, 'wb') as f:
           f.write(response.body)

我的笔记本电脑是进行维修,在备用Windows笔记本电脑上,我无法使用Python3安装Scrapy。但是我很确定这应该能完成

My laptop is out for repair, and on the spare Windows laptop I am not able to install Scrapy with Python3. But I am pretty sure this should do job

这篇关于使用Scrapy和Python3批量下载pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆