用python下载pdf [英] Download pdfs with python

查看:48
本文介绍了用python下载pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试下载位于单个 URL 中不同超链接中的多个 PDF.根据这个 文本的 URL-beautifulsoup-extract-specific-urlsf">link 然后尝试使用这种方法下载 PDF 文件 link.

I am trying to download several PDFs which are located in different hyperlinks in a single URL. My approach was first to retrieve the the URLs with contained the "fileEntryId" text which contains the PDFs, according to this link and secondly try to download the PDF files using this approach link.

这是我的"到目前为止的代码:

This is "my" code so far:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer
import re
import os
import requests
from urllib.parse import urljoin


http = httplib2.Http()
status, response = http.request('https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015')

for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a', href=re.compile('.*fileEntryId.*'))):
    if link.has_attr('href'):
        x=link['href']
        
        #If there is no such folder, the script will create one automatically
        folder_location = r'c:\webscraping'
        if not os.path.exists(folder_location):os.mkdir(folder_location)

        response = requests.get(x)
        soup= BeautifulSoup(response.text, "html.parser")     
        for link in soup.select("x"):
            #Name the pdf files using the last portion of each link which are unique in this case
            filename = os.path.join(folder_location,link['href'].split('/')[-1])
            with open(filename, 'wb') as f:
                f.write(requests.get(urljoin(url,link['href'])).content)

谢谢

推荐答案

在任意位置创建一个文件夹并将脚本放在该文件夹中.运行脚本时,您应该会在文件夹中获取下载的 pdf 文件.如果由于某种原因该脚本对您不起作用,请确保检查您的 bs4 版本是否是最新的,因为我已使用伪 css 选择器来定位所需的链接.

Create a folder anywhere and put the script in that folder. When you run the script, you should get the downloaded pdf files within the folder. If for some reason the script doesn't work for you, make sure to check whether your bs4 version is up to date as I've used pseudo css selectors to target the required links.

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
        inner_link = item.get("href")
        resp = s.get(inner_link)
        soup = BeautifulSoup(resp.text,"lxml")
        pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
        file_name = pdf_link.split("/")[-1].split("?")[0]
        with open(f"{file_name}.pdf","wb") as f:
            f.write(s.get(pdf_link).content)

这篇关于用python下载pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆