从数据框循环URL并在Python中下载pdf文件 [英] Loop url from dataframe and download pdf files in Python
本文介绍了从数据框循环URL并在Python中下载pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
基于
对于每个 url
,我需要打开并保存 pdf
格式文件:
如何在Python中做到这一点?任何帮助将不胜感激.
参考代码:
导入关闭从bs4导入BeautifulSoup汇入要求导入操作系统从urllib.parse导入urlparseurl ='xxx'对于范围(6)中的页面:r = request.get(url.format(page))汤= BeautifulSoup(r.content,"html.parser")for汤中的链接.select("h3 [class ='sv-card-title']> a"):r = request.get(link.get("href"),stream = True)r.raw.decode_content = True使用open('./files/'+ link.text +'.pdf','wb')作为f:shutil.copyfileobj(r.raw,f)
解决方案
在已上传的excel文件中下载pdf文件的示例.
从bs4 导入BeautifulSoup汇入要求#假设只有一页,如果需要下载许多文件,请将其保存在列表中.url ='http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'r = request.get(URL)汤= BeautifulSoup(r.content,"html.parser")链接= soup.select_one(.lookmore")title = soup.select_one(.newsContent").select_one("h1").text打印(title.strip()+'.pdf')数据= request.get(link.get("href")).content使用open(title.strip().replace(:",-")+'.pdf',"wb +")作为f:#文件名不应包含':',因此我将其替换为-"f.write(数据)
并成功下载:
Based on the code from here, I'm able to crawler url
for each transation and save them into an excel file which can be downloaded here.
Now I would like to go further and click the url
link:
For each url
, I will need to open and save pdf
format files:
How could I do that in Python? Any help would be greatly appreciated.
Code for references:
import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'xxx'
for page in range(6):
r = requests.get(url.format(page))
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.select("h3[class='sv-card-title']>a"):
r = requests.get(link.get("href"), stream=True)
r.raw.decode_content = True
with open('./files/' + link.text + '.pdf', 'wb') as f:
shutil.copyfileobj(r.raw, f)
解决方案
An example of download a pdf file in your uploaded excel file.
from bs4 import BeautifulSoup
import requests
# Let's assume there is only one page.If you need to download many files, save them in a list.
url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text
print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
f.write(data)
And download successfully:
这篇关于从数据框循环URL并在Python中下载pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文