如何在 Python 3 中检索带有 User-Agent 标头的文件? [英] How can I retrieve files with User-Agent headers in Python 3?

查看:27
本文介绍了如何在 Python 3 中检索带有 User-Agent 标头的文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一段(简单的)代码来从 Internet 下载文件.问题是,其中一些文件位于阻止默认 python User-Agent 标头的网站上.例如:

I'm trying to write a (simple) piece of code to download files off the internet. The problem is, some of these files are on websites that block the default python User-Agent headers. For example:

import urllib.request as html
html.urlretrieve('http://stackoverflow.com', 'index.html')

返回

urllib.error.HTTPError: HTTP Error 403: Forbidden`

通常,我会在请求中设置标头,例如:

Normally, I would set the headers in the request, such as:

import urllib.request as html
request = html.Request('http://stackoverflow.com', headers={"User-Agent":"Firefox"})
response = html.urlopen(request)

然而,由于 urlretrieve 由于某种原因不能处理请求,所以这不是一个选项.

however, as urlretrieve doesn't work with requests for some reason, this isn't an option.

是否有任何简单的解决方案(不包括导入诸如请求之类的库)?我注意到 urlretrieve 是从 Python 2 发布的遗留接口的一部分,有什么我应该使用的吗?

Are there any simple-ish solutions to this (that don't include importing a library such as requests)? I've noticed that urlretrieve is part of the legacy interface posted over from Python 2, is there anything I should be using instead?

我尝试创建一个自定义 FancyURLopener 类来处理检索文件,但这导致的问题比它解决的要多,例如为 404 的链接创建空文件.

I tried creating a custom FancyURLopener class to handle retrieving files, but that caused more problems than it solved, such as creating empty files for links that 404.

推荐答案

您可以子类化 URLopener 并将 version 类变量设置为不同的用户代理,然后继续使用网址检索.

You can subclass URLopener and set the version class variable to a different user-agent then continue using urlretrieve.

或者您可以简单地使用第二种方法并仅在检查 code == 200 后将响应保存到文件中.

Or you can simply use your second method and save the response to a file only after checking that code == 200.

这篇关于如何在 Python 3 中检索带有 User-Agent 标头的文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆