从数据帧循环 url 并在 Python 中下载 pdf 文件 [英] Loop url from dataframe and download pdf files in Python

查看：30 发布时间：2021/12/31 20:25:42 python-3.x beautifulsoup python-requests web-crawler

本文介绍了从数据帧循环 url 并在 Python 中下载 pdf 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

基于

对于每个url，我需要打开和保存pdf格式的文件:

我怎么能在 Python 中做到这一点?任何帮助将不胜感激.

参考代码:

导入shutil从 bs4 导入 BeautifulSoup进口请求导入操作系统从 urllib.parse 导入 urlparse网址 = 'xxx'对于范围内的页面(6):r = requests.get(url.format(page))汤 = BeautifulSoup(r.content, html.parser")对于soup.select中的链接(h3[class='sv-card-title']>a"):r = requests.get(link.get(href")，流=真)r.raw.decode_content = 真with open('./files/' + link.text + '.pdf', 'wb') as f:关闭.copyfileobj(r.raw, f)

解决方案

在上传的 excel 文件中下载 pdf 文件的示例.

from bs4 import BeautifulSoup进口请求# 假设只有一页.如果你需要下载很多文件，把它们保存在一个列表中.url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'r = requests.get(url)汤 = BeautifulSoup(r.content, html.parser")链接 = 汤.select_one(.lookmore")title = soup.select_one(".newsContent").select_one("h1").text打印(title.strip() + '.pdf')数据 = requests.get(link.get(href")).contentwith open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # 文件名不应包含 ':',所以我将它替换为-"f.写(数据)

并下载成功:

Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here.

Now I would like to go further and click the url link:

For each url, I will need to open and save pdf format files:

How could I do that in Python? Any help would be greatly appreciated.

Code for references:

import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'xxx'
for page in range(6):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.select("h3[class='sv-card-title']>a"):
        r = requests.get(link.get("href"), stream=True)
        r.raw.decode_content = True
        with open('./files/' + link.text + '.pdf', 'wb') as f:
            shutil.copyfileobj(r.raw, f)

解决方案

An example of download a pdf file in your uploaded excel file.

from bs4 import BeautifulSoup
import requests

# Let's assume there is only one page.If you need to download many files, save them in a list.

url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text

print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
    f.write(data)

And download successfully:

这篇关于从数据帧循环 url 并在 Python 中下载 pdf 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从数据帧循环 url 并在 Python 中下载 pdf 文件 [英] Loop url from dataframe and download pdf files in Python

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从数据帧循环 url 并在 Python 中下载 pdf 文件 [英] Loop url from dataframe and download pdf files in Python

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭