为什么urlopen可以下载Google搜索页而不下载Google Scholar搜索页? [英] Why can urlopen download a Google search page but not a Google Scholar search page?
问题描述
我正在使用 Python 3.2.3的 urllib.request
模块下载Google搜索结果,但是出现一个奇怪的错误,因为urlopen
可用于指向Google搜索结果的链接,但是不是Google学术搜索.在此示例中,我正在搜索"JOHN SMITH"
.这段代码成功打印了HTML:
I'm using Python 3.2.3's urllib.request
module to download Google search results, but I'm getting an odd error in that urlopen
works with links to Google search results, but not Google Scholar. In this example, I'm searching for "JOHN SMITH"
. This code successfully prints HTML:
from urllib.request import urlopen, Request
from urllib.error import URLError
# Google
try:
page_google = '''http://www.google.com/#hl=en&sclient=psy-ab&q=%22JOHN+SMITH%22&oq=%22JOHN+SMITH%22&gs_l=hp.3..0l4.129.2348.0.2492.12.10.0.0.0.0.154.890.6j3.9.0...0.0...1c.gjDBcVcGXaw&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=dffb3b4a4179ca7c&biw=1366&bih=649'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
print(html_google[0:10])
except URLError as e:
print(e)
但是此代码(对Google学术搜索执行相同操作)会引发URLError
异常:
but this code, doing the same for Google Scholar, raises a URLError
exception:
from urllib.request import urlopen, Request
from urllib.error import URLError
# Google Scholar
try:
page_scholar = '''http://scholar.google.com/scholar?hl=en&q=%22JOHN+SMITH%22&btnG=&as_sdt=1%2C14'''
req_scholar = Request(page_scholar)
req_scholar.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_scholar = urlopen(req_scholar).read()
print(html_scholar[0:10])
except URLError as e:
print(e)
跟踪:
Traceback (most recent call last):
File "/home/ak5791/Desktop/code-sandbox/scholar/crawler.py", line 6, in <module>
html = urlopen(page).read()
File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 369, in open
response = self._open(req, data)
File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
'_open', req)
File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 1155, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.2/urllib/request.py", line 1138, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -5] No address associated with hostname>
我是通过在Chrome中搜索并从那里复制链接来获得这些链接的.一位评论者报告了403错误,有时我也会发现.我想这是因为Google不支持抓取Scholar.但是,更改用户代理字符串不能解决此或原始问题,因为大多数时候我都得到URLErrors
.
I obtained these links by searching in Chrome and copying the link from there. One commenter reported a 403 error, which I sometimes get as well. I presume this is because Google doesn't support scraping of Scholar. However, changing the User Agent string doesn't fix this or the original problem, since I get URLErrors
most of the time.
推荐答案
此PHP脚本似乎表明您需要先设置一些Cookie,然后Google才能为您提供结果:
This PHP script seems to indicate you'll need to set some cookies before Google gives you results:
/*
Need a cookie file (scholar_cookie.txt) like this:
# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
.scholar.google.com TRUE / FALSE 2147483647 GSP ID=353e8f974d766dcd:CF=2
.google.com TRUE / FALSE 1317124758 PREF ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk TRUE / FALSE 2147483647 GSP ID=f3f18b3b5a7c2647:CF=2
.google.co.uk TRUE / FALSE 1317125123 PREF ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN
*/
Google学术搜索注释的Python配方,其中包含一个警告,指示Google检测到脚本,如果您过于熟练地使用脚本,则会禁用您.
This is corroborated by Python recipe for Google Scholar comment, which includes a warning that Google detects scripts and will disable you if you use it too prolifically.
这篇关于为什么urlopen可以下载Google搜索页而不下载Google Scholar搜索页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!