从数据框循环URL并在Python中下载pdf文件 [英] Loop url from dataframe and download pdf files in Python

查看:39
本文介绍了从数据框循环URL并在Python中下载pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基于

对于每个 url ,我需要打开并保存 pdf 格式文件:

如何在Python中做到这一点?任何帮助将不胜感激.

参考代码:

 导入关闭从bs4导入BeautifulSoup汇入要求导入操作系统从urllib.parse导入urlparseurl ='xxx'对于范围(6)中的页面:r = request.get(url.format(page))汤= BeautifulSoup(r.content,"html.parser")for汤中的链接.select("h3 [class ='sv-card-title']> a"):r = request.get(link.get("href"),stream = True)r.raw.decode_content = True使用open('./files/'+ link.text +'.pdf','wb')作为f:shutil.copyfileobj(r.raw,f) 

解决方案

在已上传的excel文件中下载pdf文件的示例.

从bs4

 导入BeautifulSoup汇入要求#假设只有一页,如果需要下载许多文件,请将其保存在列表中.url ='http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'r = request.get(URL)汤= BeautifulSoup(r.content,"html.parser")链接= soup.select_one(.lookmore")title = soup.select_one(.newsContent").select_one("h1").text打印(title.strip()+'.pdf')数据= request.get(link.get("href")).content使用open(title.strip().replace(:",-")+'.pdf',"wb +")作为f:#文件名不应包含':',因此我将其替换为-"f.write(数据) 

并成功下载:

Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here.

Now I would like to go further and click the url link:

For each url, I will need to open and save pdf format files:

How could I do that in Python? Any help would be greatly appreciated.

Code for references:

import shutil
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'xxx'
for page in range(6):
    r = requests.get(url.format(page))
    soup = BeautifulSoup(r.content, "html.parser")
    for link in soup.select("h3[class='sv-card-title']>a"):
        r = requests.get(link.get("href"), stream=True)
        r.raw.decode_content = True
        with open('./files/' + link.text + '.pdf', 'wb') as f:
            shutil.copyfileobj(r.raw, f)

解决方案

An example of download a pdf file in your uploaded excel file.

from bs4 import BeautifulSoup
import requests

# Let's assume there is only one page.If you need to download many files, save them in a list.

url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text

print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
    f.write(data)

And download successfully:

这篇关于从数据框循环URL并在Python中下载pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆