使用 Python 抓取网页 [英] Web scraping using Python

查看:42
本文介绍了使用 Python 抓取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 urllib2 和 BeautifulSoup 抓取网站 http://www.nseindia.com.不幸的是,当我尝试通过 Python 访问页面时,我不断收到 403 Forbidden.我认为这是一个用户代理问题,但改变它并没有帮助.然后我认为这可能与 cookie 有关,但显然通过关闭 cookie 的链接加载页面工作正常.什么可能会阻止通过 urllib 的请求?

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?

推荐答案

http://www.nseindia.com/ 似乎需要一个 Accept 标头,无论出于何种原因.这应该有效:

http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:

import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author@example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()

拒绝没有 Accept 标头的请求是不正确的;RFC 2616 明确规定

Refusing requests without Accept headers is incorrect; RFC 2616 clearly states

如果不存在 Accept 头域,则假定客户端接受所有媒体类型.

If no Accept header field is present, then it is assumed that the client accepts all media types.

这篇关于使用 Python 抓取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆