urllib2为浏览器显示正常的网站返回404 [英] urllib2 returns 404 for a website which displays fine in browsers
问题描述
我无法使用urllib2打开一个特定的网址。同样的方法适用于其他网站,例如http://www.google.com,但不适用于此网站(在浏览器中也显示正常)。
我的简单的代码:
from BeautifulSoup import BeautifulSoup
import urllib2
$ b $ url url =http: //www.experts.scival.com/einstein/
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
打印汤
任何人都可以帮助我做到这一点吗?
这是错误我:
回溯(最近通话最后):
文件/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py,第12行,在< module>
response = urllib2.urlopen(url);
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第126行,用urlopen
返回_opener.open(url,data,超时)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第400行,打开
response = meth(req,response )
在http_response
'http'中的文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第513行,请求,响应,代码,msg,hdrs)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第432行,错误
result = self ._call_chain(* args)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第372行,在_call_chain
result = func (* args)
在http_error_302
中返回self.parent文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第619行。打开(新的,超时= req.timeout)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第400行,打开
response = meth(req,response)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第513行,http_response
'http',请求,响应,代码,msg,hdrs )
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第438行,错误
返回self._call_chain(* args)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第372行,在_call_chain
result = func(* args)
在http_error_default
中的文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第521行,引发HTTPError(req.get_full_url(),code,
$ b $ p
你解决方案我刚试过这个并获得404码和封底。
在猜测它做的User-Agent检测其意外或故意不提供内容到Python的urllib。
澄清,使用 urllib
,我收到了 urlopen
返回的a具有404代码和HTML内容的响应对象。发生 urllib2.urlopen
一个 urllib2.HTTPError
异常。
我建议您尝试将您的用户代理设置为看起来像浏览器的东西。这里有个问题:更改urllib2.urlopen上的用户代理 p>
I am not able to open one particular url using urllib2. Same approach works well with other websites such as "http://www.google.com" but not this site (which also displays fine in the browser).
my simple code:
from BeautifulSoup import BeautifulSoup
import urllib2
url="http://www.experts.scival.com/einstein/"
response=urllib2.urlopen(url)
html=response.read()
soup=BeautifulSoup(html)
print soup
Can anyone help me to make it work?
this is error I got:
Traceback (most recent call last):
File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module>
response=urllib2.urlopen(url);
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error
result = self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
Thank you
解决方案 I just tried this and received 404 code and page back.
At a guess it's doing User-Agent detection which either by accident or on purpose doesn't serve content to python urllib.
Clarification, with urllib
, I received the urlopen
returned a response object with a 404 code and HTML content. With urllib2.urlopen
an urllib2.HTTPError
exception was raised.
I'd suggest you try setting your User Agent to something that looks like a browser. There's a question about this here: Changing user agent on urllib2.urlopen
这篇关于urllib2为浏览器显示正常的网站返回404的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!