urllib2为浏览器显示正常的网站返回404 [英] urllib2 returns 404 for a website which displays fine in browsers

查看:120
本文介绍了urllib2为浏览器显示正常的网站返回404的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法使用urllib2打开一个特定的网址。同样的方法适用于其他网站,例如http://www.google.com,但不适用于此网站(在浏览器中也显示正常)。



我的简单的代码:

  from BeautifulSoup import BeautifulSoup 
import urllib2
$ b $ url url =http: //www.experts.scival.com/einstein/
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
打印汤

任何人都可以帮助我做到这一点吗?



这是错误我:

 回溯(最近通话最后):
文件/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py,第12行,在< module>
response = urllib2.urlopen(url);
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第126行,用urlopen
返回_opener.open(url,data,超时)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第400行,打开
response = meth(req,response )
在http_response
'http'中的文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第513行,请求,响应,代码,msg,hdrs)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第432行,错误
result = self ._call_chain(* args)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第372行,在_call_chain
result = func (* args)
在http_error_302
中返回self.parent文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第619行。打开(新的,超时= req.timeout)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第400行,打开
response = meth(req,response)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第513行,http_response
'http',请求,响应,代码,msg,hdrs )
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第438行,错误
返回self._call_chain(* args)
文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第372行,在_call_chain
result = func(* args)
在http_error_default
中的文件/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,第521行,引发HTTPError(req.get_full_url(),code,

$ b $ p

解决方案

我刚试过这个并获得404码和封底。



在猜测它做的User-Agent检测其意外或故意不提供内容到Python的urllib。



澄清,使用 urllib ,我收到了 urlopen 返回的a具有404代码和HTML内容的响应对象。发生 urllib2.urlopen 一个 urllib2.HTTPError 异常。



我建议您尝试将您的用户代理设置为看起来像浏览器的东西。这里有个问题:更改urllib2.urlopen上的用户代理 p>

I am not able to open one particular url using urllib2. Same approach works well with other websites such as "http://www.google.com" but not this site (which also displays fine in the browser).

my simple code:

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.experts.scival.com/einstein/"
response=urllib2.urlopen(url)
html=response.read()
soup=BeautifulSoup(html)
print soup

Can anyone help me to make it work?

this is error I got:

Traceback (most recent call last):
  File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module>
    response=urllib2.urlopen(url);
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error
    result = self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
    response = meth(req, response)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
    return self._call_chain(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found

Thank you

解决方案

I just tried this and received 404 code and page back.

At a guess it's doing User-Agent detection which either by accident or on purpose doesn't serve content to python urllib.

Clarification, with urllib, I received the urlopen returned a response object with a 404 code and HTML content. With urllib2.urlopen an urllib2.HTTPError exception was raised.

I'd suggest you try setting your User Agent to something that looks like a browser. There's a question about this here: Changing user agent on urllib2.urlopen

这篇关于urllib2为浏览器显示正常的网站返回404的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆