使用Python(或R)提取Google Scholar结果 [英] Extract Google Scholar results using Python (or R)

查看:406
本文介绍了使用Python(或R)提取Google Scholar结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用python抓取Google Scholar搜索结果.我发现有两种不同的脚本可以做到这一点,一种是 gscholar.py ,另一种是scholar.py(可以用作python库?).

I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is scholar.py (can that one be used as a python library?).

现在,我也许应该说我是python的新手,如果错过了显而易见的内容,请对不起!

Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious!

问题是当我按照README文件中的说明使用gscholar.py时,我得到了

The problem is when I use gscholar.py as explained in the README file, I get as a result

query() takes at least 2 arguments (1 given).

即使我指定了另一个参数(例如gscholar.query("my query", allresults=True),我也会得到

Even when I specify another argument (e.g. gscholar.query("my query", allresults=True), I get

query() takes at least 2 arguments (2 given).

这使我感到困惑.我还尝试指定第三个可能的参数(outformat=4;这是BibTex格式),但这给了我一系列函数错误.一位同事建议我在运行查询之前导入BeautifulSoup和,但这也不会改变问题.有什么建议可以解决问题吗?

This puzzles me. I also tried to specify the third possible argument (outformat=4; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem?

我找到了R的代码(请参见 link )作为解决方案,但很快就被Google封锁.也许有人可以建议如何改进该代码以避免被阻止?任何帮助,将不胜感激!谢谢!

I found code for R (see link) as a solution but got quickly blocked by google. Maybe someone could suggest how improve that code to avoid being blocked? Any help would be appreciated! Thanks!

推荐答案

我建议您不要使用特定的库来爬网特定的网站,而应使用经过良好测试并具有格式良好的文档(例如BeautifulSoup)的通用HTML库.

I suggest you not to use specific libraries for crawling specific websites, but to use general purpose HTML libraries that are well tested and has well formed documentation such as BeautifulSoup.

要访问带有浏览器信息的网站,可以将URL Opener类与自定义用户代理一起使用:

For accessing websites with a browser information, you could use an url opener class with a custom user agent:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

然后下载所需的URL,如下所示:

And then download the required url as follows:

openurl(url).read()

要检索学者结果,只需使用http://scholar.google.se/scholar?hl=en&q=${query}网址.

For retrieving scholar results just use http://scholar.google.se/scholar?hl=en&q=${query} url.

要从检索到的HTML文件中提取信息,您可以使用以下代码:

To extract pieces of information from a retrieved HTML file, you could use this piece of code:

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

这段代码提取了一个具体的div元素,其中包含Google学术搜索结果页中显示的许多结果.

This piece of code extracts a concrete div element that contains number of results shown in a Google Scholar search results page.

这篇关于使用Python(或R)提取Google Scholar结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆