使用 urllib2 read() 时出现 HTTPError [英] HTTPError when using urllib2 read()

查看:30
本文介绍了使用 urllib2 read() 时出现 HTTPError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 urllib2 和 BeautifulSoup 抓取网页.它工作正常,然后当我在代码的不同部分放入 input() 以尝试调试某些内容时,我收到了 HTTPError.当我再次尝试运行我的程序时,我在尝试调用 read() 时遇到了 HTTPError.错误堆栈如下:

I'm trying to scrape a web page using urllib2 and BeautifulSoup. It was working fine and then when I put in an input() in a different part of my code to try and debug something, I got an HTTPError. When I tried running my program again, I got an HTTPError when trying calling read(). The error stack is below:

[2013-07-17 16:47:07,415: ERROR/MainProcess] Task program.tasks.testTask[460db7cf-ff58-4a51-9c0f-749affc66abb] raised exception: IOError()
16:47:07 celeryd.1 | Traceback (most recent call last):
16:47:07 celeryd.1 |   File "/Users/username/folder/server2/venv/lib/python2.7/site-packages/celery/execute/trace.py", line 181, in trace_task
16:47:07 celeryd.1 |     R = retval = fun(*args, **kwargs)
16:47:07 celeryd.1 |   File "/Users/username/folder/server2/program/tasks.py", line 193, in run
16:47:07 celeryd.1 |     self.get_top_itunes_game_by_genre(genre)
16:47:07 celeryd.1 |   File "/Users/username/folder/server2/program/tasks.py", line 244, in get_top_itunes_game_by_genre
16:47:07 celeryd.1 |     game_page = BeautifulSoup(urllib2.urlopen(game_url).read())
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
16:47:07 celeryd.1 |     return _opener.open(url, data, timeout)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
16:47:07 celeryd.1 |     response = meth(req, response)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
16:47:07 celeryd.1 |     'http', request, response, code, msg, hdrs)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
16:47:07 celeryd.1 |     return self._call_chain(*args)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
16:47:07 celeryd.1 |     result = func(*args)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
16:47:07 celeryd.1 |     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
16:47:07 celeryd.1 | HTTPError

代码如下:

for game_url in urls:    
    game_page = BeautifulSoup(urllib2.urlopen(game_url).read())
    # code to process page

有谁知道我为什么开始收到这个错误?谢谢!

Does anyone know why I started getting this error? Thanks!

推荐答案

将我的评论更改为答案:

Changing my comment into an answer:

您正在抓取的页面以(很可能)4xx 响应进行响应,而 urllib2 引发了 HTTPError,正如它在 文档.您的工作是捕获该异常并(希望)对其进行处理,记录它或您拥有什么.无论出于何种原因,您的回溯都不会显示 HTTPError 的代码/原因,但它就在那里.查看错误的代码"和原因"属性.

The page that you're scraping responded with (most likely) a 4xx response, and urllib2 raises an HTTPError, as it says it does in the docs. It is your job to catch that exception and (hopefully) do something with it, log it or what have you. Your traceback doesn't display the code/reason for the HTTPError for whatever reason, but it is there. Look at the 'code' and 'reason' attributes of the error.

社论:您正在抓取的网站可能会发现您是机器人.您可能需要花点时间重写您的抓取工具,以使用对服务器更友好(和更好的 API)的库.urllib2 适用于一次性任务,但它有许多缺点,我不会在这里讨论.可能要查看的高级库是 requests机械化,也许httplib2.都有优点/缺点,所以我不能告诉你哪个适合你的需求.

editorial: It is possible that the website that you were scraping figured out that you're a robot. You might want to take a moment to rewrite your scraper to use a more server-friendly (and vastly better API) library. urllib2 is fine for one-off tasks but it has numerous shortcomings that I won't get into here. Possible superior libraries to look at are requests, mechanize, maybe httplib2. All have up/downsides so I can't tell you the one that's right for your needs.

您可能还想查看随请求一起发送的用户代理标头,因为如果您自我识别为机器人,那么.是的.

You also may want to look at what user-agent header you're sending with your requests, since if you self-identify as a robot, well. Yeah.

这篇关于使用 urllib2 read() 时出现 HTTPError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆