从数据帧循环 url 并在 Python 中下载 pdf 文件 [英] Loop url from dataframe and download pdf files in Python
本文介绍了从数据帧循环 url 并在 Python 中下载 pdf 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
基于
对于每个url
,我需要打开和保存pdf
格式的文件:
我怎么能在 Python 中做到这一点?任何帮助将不胜感激.
参考代码:
导入shutil从 bs4 导入 BeautifulSoup进口请求导入操作系统从 urllib.parse 导入 urlparse网址 = 'xxx'对于范围内的页面(6):r = requests.get(url.format(page))汤 = BeautifulSoup(r.content, html.parser")对于soup.select中的链接(h3[class='sv-card-title']>a"):r = requests.get(link.get(href"),流=真)r.raw.decode_content = 真with open('./files/' + link.text + '.pdf', 'wb') as f:关闭.copyfileobj(r.raw, f)
解决方案
在上传的 excel 文件中下载 pdf 文件的示例.
from bs4 import BeautifulSoup进口请求# 假设只有一页.如果你需要下载很多文件,把它们保存在一个列表中.url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'r = requests.get(url)汤 = BeautifulSoup(r.content, html.parser")链接 = 汤.select_one(.lookmore")title = soup.select_one(".newsContent").select_one("h1").text打印(title.strip() + '.pdf')数据 = requests.get(link.get(href")).contentwith open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # 文件名不应包含 ':',所以我将它替换为-"f.写(数据)
并下载成功:
Based on the code from here, I'm able to crawler url
for each transation and save them into an excel file which can be downloaded here.
Now I would like to go further and click the url
link:
For each url
, I will need to open and save pdf
format files:
How could I do that in Python? Any help would be greatly appreciated.
Code for references:
import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'xxx'
for page in range(6):
r = requests.get(url.format(page))
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.select("h3[class='sv-card-title']>a"):
r = requests.get(link.get("href"), stream=True)
r.raw.decode_content = True
with open('./files/' + link.text + '.pdf', 'wb') as f:
shutil.copyfileobj(r.raw, f)
解决方案
An example of download a pdf file in your uploaded excel file.
from bs4 import BeautifulSoup
import requests
# Let's assume there is only one page.If you need to download many files, save them in a list.
url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text
print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
f.write(data)
And download successfully:
这篇关于从数据帧循环 url 并在 Python 中下载 pdf 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文