Python从页面上的链接下载多个文件 [英] Python download multiple files from links on pages
问题描述
I think I have to use urlopen
to open each url and then use urlretrieve
to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new BeautifulSoup
object for each game? I'm also unsure of how urlretrieve
works.
import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup
url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
urlopen('http://chessgames.com'+link.get('href'))
推荐答案
您的问题没有简短的答案.我将向您展示完整的解决方案并注释此代码.
There is no short answer to your question. I will show you a complete solution and comment this code.
首先,导入必要的模块:
First, import necessary modules:
from bs4 import BeautifulSoup
import requests
import re
接下来,获取索引页面并创建 BeautifulSoup
对象:
Next, get index page and create BeautifulSoup
object:
req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")
我强烈建议使用 lxml
解析器,而不是常见的 html.parser
之后,您应该准备游戏的链接列表:
I strongly advice to use lxml
parser, not common html.parser
After that, you should prepare game's links list:
pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))
您可以通过搜索其中包含"chessgame"单词的链接来做到这一点.现在,您应该准备将为您下载文件的功能:
You can do it by searching links containing 'chessgame' word in it. Now, you should prepare function which will download files for you:
def download_file(url):
path = url.split('/')[-1].split('?')[0]
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
最后一个法宝是重复所有前面的步骤,为文件下载器准备链接:
And final magic is to repeat all previous steps preparing links for file downloader:
host = 'http://www.chessgames.com'
for page in pages:
url = host + page.get('href')
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
file_link = soup.find('a',text=re.compile('.*download.*'))
file_url = host + file_link.get('href')
download_file(file_url)
(首先您搜索描述中包含文本"download"的链接,然后构造完整的url-连接主机名和路径,最后下载文件)
(first you search links containing text 'download' in their description, then construct full url - concatenate hostname and path, and finally download file)
希望您可以不经修改就使用此代码!
I hope you can use this code without correction!
这篇关于Python从页面上的链接下载多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!