使用网址抓取大量Google学术搜索页面 [英] Scraping large amount of Google Scholar pages with url

查看：141 发布时间：2020/7/23 3:36:24 web-scraping beautifulsoup captcha google-scholar

本文介绍了使用网址抓取大量Google学术搜索页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用BeautifulSoup从Google学术搜索中的作者那里获取所有出版物的完整作者列表.由于作者的主页只有每篇论文的被删节的作者列表，因此我必须打开论文的链接以获取完整列表.结果，我每隔几次尝试就遇到CAPTCHA.

I'm trying to get full author list of all publications from an author on Google scholar using BeautifulSoup. Since the home page for the author only has a truncated list of authors for each paper, I have to open the link of the paper to get full list. As a result, I ran into CAPTCHA every few attempts.

是否有避免CAPTCHA的方法(例如，在每次请求后暂停3秒钟)?还是将原始的Google学术搜索个人资料页面显示为完整的作者列表?

Is there a way to avoid CAPTCHA (e.g. pause for 3 secs after every request)? Or make the original Google Scholar profile page to show full author list?

推荐答案

最近我遇到了类似的问题.我至少通过以下一种简单的解决方法简化了我的收集过程，即实现了 random 和持久的睡眠，如下所示:

Recently I faced similar issue. I at least eased my collection process with an easy workaround by implementing a random and rather longlasting sleep like this:

import time
import numpy as np

time.sleep((30-5)*np.random.random()+5) #from 5 to 30 seconds

如果您有足够的时间(例如在晚上启动解析器)，则可以进行更大的暂停(是大3倍以上)，以确保不会出现验证码.

If you have enough time (let's say launch your parser at night), you can make even bigger pause (3+ times bigger) to assure you won't get captcha.

此外，您可以在对站点的请求中随机更改user-agent，这将进一步掩盖您的利益.

Furthermore, you can randomly change user-agents in your requests to site, that will mask you even more.

这篇关于使用网址抓取大量Google学术搜索页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用网址抓取大量Google学术搜索页面 [英] Scraping large amount of Google Scholar pages with url

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用网址抓取大量Google学术搜索页面 [英] Scraping large amount of Google Scholar pages with url

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭