为什么urlopen可以下载Google搜索页而不下载Google Scholar搜索页? [英] Why can urlopen download a Google search page but not a Google Scholar search page?

查看:121
本文介绍了为什么urlopen可以下载Google搜索页而不下载Google Scholar搜索页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 3.2.3的 urllib.request模块下载Google搜索结果,但是出现一个奇怪的错误,因为urlopen可用于指向Google搜索结果的链接,但是不是Google学术搜索.在此示例中,我正在搜索"JOHN SMITH".这段代码成功打印了HTML:

I'm using Python 3.2.3's urllib.request module to download Google search results, but I'm getting an odd error in that urlopen works with links to Google search results, but not Google Scholar. In this example, I'm searching for "JOHN SMITH". This code successfully prints HTML:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google
try:
    page_google = '''http://www.google.com/#hl=en&sclient=psy-ab&q=%22JOHN+SMITH%22&oq=%22JOHN+SMITH%22&gs_l=hp.3..0l4.129.2348.0.2492.12.10.0.0.0.0.154.890.6j3.9.0...0.0...1c.gjDBcVcGXaw&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=dffb3b4a4179ca7c&biw=1366&bih=649'''
    req_google = Request(page_google)
    req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
    html_google = urlopen(req_google).read()
    print(html_google[0:10])
except URLError as e:
    print(e)

但是此代码(对Google学术搜索执行相同操作)会引发URLError异常:

but this code, doing the same for Google Scholar, raises a URLError exception:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google Scholar
try:
    page_scholar = '''http://scholar.google.com/scholar?hl=en&q=%22JOHN+SMITH%22&btnG=&as_sdt=1%2C14'''
    req_scholar = Request(page_scholar)
    req_scholar.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
    html_scholar = urlopen(req_scholar).read()
    print(html_scholar[0:10])
except URLError as e:
    print(e)

跟踪:

Traceback (most recent call last):
  File "/home/ak5791/Desktop/code-sandbox/scholar/crawler.py", line 6, in <module>
    html = urlopen(page).read()
  File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.2/urllib/request.py", line 369, in open
    response = self._open(req, data)
  File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
    '_open', req)
  File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.2/urllib/request.py", line 1155, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.2/urllib/request.py", line 1138, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -5] No address associated with hostname>

我是通过在Chrome中搜索并从那里复制链接来获得这些链接的.一位评论者报告了403错误,有时我也会发现.我想这是因为Google不支持抓取Scholar.但是,更改用户代理字符串不能解决此或原始问题,因为大多数时候我都得到URLErrors.  

I obtained these links by searching in Chrome and copying the link from there. One commenter reported a 403 error, which I sometimes get as well. I presume this is because Google doesn't support scraping of Scholar. However, changing the User Agent string doesn't fix this or the original problem, since I get URLErrors most of the time.  

推荐答案

此PHP脚本似乎表明您需要先设置一些Cookie,然后Google才能为您提供结果:

This PHP script seems to indicate you'll need to set some cookies before Google gives you results:

/*

 Need a cookie file (scholar_cookie.txt) like this:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2
.google.com     TRUE    /       FALSE   1317124758      PREF    ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk   TRUE    /       FALSE   2147483647      GSP     ID=f3f18b3b5a7c2647:CF=2
.google.co.uk   TRUE    /       FALSE   1317125123      PREF    ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

*/

Google学术搜索注释的Python配方,其中包含一个警告,指示Google检测到脚本,如果您过于熟练地使用脚本,则会禁用您.

This is corroborated by Python recipe for Google Scholar comment, which includes a warning that Google detects scripts and will disable you if you use it too prolifically.

这篇关于为什么urlopen可以下载Google搜索页而不下载Google Scholar搜索页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆