如何从动态网站下载和保存所有PDF? [英] How to download and save all PDF from a dynamic web?

查看:243
本文介绍了如何从动态网站下载和保存所有PDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试下载某些网站中包含的具有动态元素的所有PDF并将其保存在文件夹中: https://www.bankinter.com/banca/nav/documentos-datos-fundamentales

I am trying to download and save in a folder all the PDFs contained in some webs with dynamic elements i.e: https://www.bankinter.com/banca/nav/documentos-datos-fundamentales

此URL中的每个PDF都具有相似的href.这里是其中两个: " https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc=workspace://SpacesStore/fb029023-dd29-47d5-8927-31021d834757;1.0&nameDoc=ISIN_ES0213679FW7_41-Bonos_Estructurados16. .pdf "

Every PDF in this url have similar href. Here they are two of them: "https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc=workspace://SpacesStore/fb029023-dd29-47d5-8927-31021d834757;1.0&nameDoc=ISIN_ES0213679FW7_41-Bonos_EstructuradosGarantizad_19.16_es.pdf"

"

这是我为另一个网站所做的工作,此代码可以按需工作:

Here it is what I did for another web, this code is working as desired:

link = 'https://www.bankia.es/estaticos/documentosPRIIPS/json/jsonSimple.txt'
base = 'https://www.bankia.es/estaticos/documentosPRIIPS/{}'

dirf = os.environ['USERPROFILE'] + "\Documents\TFM\PdfFolder"
if not os.path.exists(dirf2):os.makedirs(dirf2)
os.chdir(dirf2)

res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
for item in res.json():
    if not 'nombre_de_fichero' in item: continue
    link = base.format(item['nombre_de_fichero'])
    filename_bankia = item['nombre_de_fichero'].split('.')[-2] + ".PDF"
    with open(filename_bankia, 'wb') as f:
        f.write(requests.get(link).content)

推荐答案

您必须使用适当的json参数发出发布http请求.收到响应后,您必须解析两个字段objectIdnombreFichero,以使用它们来构建指向pdf的正确链接.以下应该可以工作:

You have to make a post http requests with appropriate json parameter. Once you get the response, you have to parse two fields objectId and nombreFichero to use them to build right links to the pdf's. The following should work:

import os
import json
import requests

url = 'https://bancaonline.bankinter.com/publico/rs/documentacionPrix/list'
base = 'https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc={}&nameDoc={}'
payload = {"cod_categoria": 2,"cod_familia": 3,"divisaDestino": None,"vencimiento": None,"edadActuarial": None}

dirf = os.environ['USERPROFILE'] + "\Desktop\PdfFolder"
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)

r = requests.post(url,json=payload)
for item in r.json():
    objectId = item['objectId']
    nombreFichero = item['nombreFichero'].replace(" ","_")
    filename = nombreFichero.split('.')[-2] + ".PDF"
    link = base.format(objectId,nombreFichero)
    with open(filename, 'wb') as f:
        f.write(requests.get(link).content)

执行上述脚本后,请稍等一下以使其正常运行,因为该网站的运行速度很慢.

After executing the above script, wait a little for it to work as the site is real slow.

这篇关于如何从动态网站下载和保存所有PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆