Python从页面上的链接下载多个文件 [英] Python download multiple files from links on pages

查看:41
本文介绍了Python从页面上的链接下载多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此我认为我必须使用 urlopen 打开每个url,然后使用 urlretrieve 通过从每个游戏底部附近的下载按钮访问每个pgn来下载每个pgn.我是否必须为每个游戏创建一个新的 BeautifulSoup 对象?我也不确定 urlretrieve 的工作方式.

I think I have to use urlopen to open each url and then use urlretrieve to download each pgn by accessing it from the download button near the bottom of each game. Do I have to create a new BeautifulSoup object for each game? I'm also unsure of how urlretrieve works.

import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup

url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
    urlopen('http://chessgames.com'+link.get('href'))

推荐答案

您的问题没有简短的答案.我将向您展示完整的解决方案并注释此代码.

There is no short answer to your question. I will show you a complete solution and comment this code.

首先,导入必要的模块:

First, import necessary modules:

from bs4 import BeautifulSoup
import requests
import re

接下来,获取索引页面并创建 BeautifulSoup 对象:

Next, get index page and create BeautifulSoup object:

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

我强烈建议使用 lxml 解析器,而不是常见的 html.parser 之后,您应该准备游戏的链接列表:

I strongly advice to use lxml parser, not common html.parser After that, you should prepare game's links list:

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

您可以通过搜索其中包含"chessgame"单词的链接来做到这一点.现在,您应该准备将为您下载文件的功能:

You can do it by searching links containing 'chessgame' word in it. Now, you should prepare function which will download files for you:

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

最后一个法宝是重复所有前面的步骤,为文件下载器准备链接:

And final magic is to repeat all previous steps preparing links for file downloader:

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

(首先您搜索描述中包含文本"download"的链接,然后构造完整的url-连接主机名和路径,最后下载文件)

(first you search links containing text 'download' in their description, then construct full url - concatenate hostname and path, and finally download file)

希望您可以不经修改就使用此代码!

I hope you can use this code without correction!

这篇关于Python从页面上的链接下载多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆