从Google Scholar获取作者的姓名和标签的URL [英] Get authors name and URL for tag from google scholar

查看:417
本文介绍了从Google Scholar获取作者的姓名和标签的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望将所有作者的列表写到CSV文件中,并将其URL指向CSV文件,这些作者将自己归类为Google学术搜索中的特定标签.例如,如果我们要使用安全性" 我想要此输出:

I wish to write to a CSV file a list of all authors with their URL to a CSV file who class themselves as a specific tag on Google Scholar. For example, if we were to take 'security' I would want this output:

author          url
Howon Kim       https://scholar.google.pl/citations?user=YUoJP-oAAAAJ&hl=pl
Adrian Perrig   https://scholar.google.pl/citations?user=n-Oret4AAAAJ&hl=pl
...             ...

我已经编写了这段代码,其中显示了每个作者的姓名

I have written this code which prints each author's name

# -*- coding: utf-8 -*-
import urllib.request
import csv
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
mydivs = soup.findAll("h3", { "class" : "gsc_1usr_name"})
outputFile = open('sample.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
for each in mydivs:
    for anchor in each.find_all('a'):
        print (anchor.text)

但是,仅在第一页上执行此操作.相反,我想浏览每一页.我该怎么办?

However, this only does it for the first page. Instead, I would like to go through every page. How can I do this?

推荐答案

我不是在为您编写代码..但我会为您提供一个概述.

I'm not writing the code for you.. but I'll give you an outline for how you can.

查看页面底部.看到下一个按钮?搜索它,包含div的idgsc_authors_bottom_pag,应该很容易找到.我要用硒来做,找到下一个按钮(右)并单击它.等待页面加载,然后刮取重复.处理边缘情况(页面不足等).

Look at the bottom of the page. See the next button? Search for it the containing div has an id of gsc_authors_bottom_pag which should be easy to find. I'd do this with selenium, find the next button (right) and click it. Wait for the page to load, scrape repeat. Handle edge cases (out of pages, etc).

如果url中的after_author=*位没有更改,则可以增加url开始..除非您想(不太可能)尝试破解该代码,否则只需单击下一个按钮即可.

If the after_author=* bit didn't change in the url you could just increment the url start.. but unless you want to try to crack that code (unlikely) then just click the next button.

这篇关于从Google Scholar获取作者的姓名和标签的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆