从IDEAS中提取学术出版物信息 [英] Extract academic publication information from IDEAS

查看:33
本文介绍了从IDEAS中提取学术出版物信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从特定的 IDEAS的页面中提取出版物列表..我想检索有关论文名称,作者和年份的信息.但是,我在这样做上有些卡住.通过检查页面,所有信息都在 div class ="tab-pane fade show active" [...] 内部,然后使用 h3 ,我们可以确定年份为在每个 li class ="list-group-downm downfree" [...] 内的出版物中,我们可以找到具有相关作者的每篇论文(如本图片).最后,我愿意获得的是一个包含三列的数据框:标题,作者和年份.

I want to extract the list of publications from a specific IDEAS's page. I want to retrieve information about name of the paper, authors, and year. However, I am bit stuck in doing so. By inspecting the page, all information is inside the div class="tab-pane fade show active" [...], then with h3 we do have the year of publication while inside each li class="list-group-item downfree" [...] we can find each paper with relative author (as showed in this image). At the end, what I willing to obtain is a dataframe containing three columns: title, author, and year.

尽管如此,虽然我能够检索每篇论文的名称,但是当我还要添加年份和作者时,我却感到困惑.到目前为止,我写的是以下短代码:

Nonetheless, while I am able to retrieve each paper's name, when I want to add also year and author(s) I get confused. What I wrote so far is the following short code:

from requests import get
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

containers = soup.findAll("div", {'class': 'tab-pane fade show active'})

title_list = []
year_list = []

for container in containers:

    year = container.findAll('h3')
    year_list.append(int(year[0].text))

    title_containers = container.findAll("li", {'class': 'list-group-item downfree'})
    title = title_containers[0].a.text
    title_list.append(title)  

我得到的是两个仅包含一个元素的列表.这是因为初始容器的大小为1.关于不知道如何检索作者姓名的问题,我尝试了几种方法,但均未成功.我想我必须使用'by'作为分隔符来剥离标题.

What I get are two list of only one element each. This because the initial containers has the size of 1. Regarding instead how to retrieve author(s) name I have no idea, I tried in several ways without success. I think I have to stripe the titles using 'by' as separator.

我希望有人能帮助我或重定向到其他面临类似情况的讨论.先感谢您.为我的(可能)愚蠢的问题表示歉意,我仍然是BeautifulSoup进行网络抓取的初学者.

I hope someone could help me or re-direct to some other discussion which face a similar situation. Thank you in advance. Apologize for my (probably) silly question, I am still a beginner in web scraping with BeautifulSoup.

推荐答案

您可以获取所需的信息,如下所示:

You can get the desired information like this:

from requests import get
import pprint
from bs4 import BeautifulSoup

url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one("#content")
title_list = []
author_list = []
year_list = [int(h.text) for h in container.find_all('h3')]
for panel in container.select("div.panel-body"):
    title_list.append([x.text for x in panel.find_all('a')])
    author_list.append([x.next_sibling.strip() for x in panel.find_all('i')])
result = list(zip(year_list, title_list, author_list))

pp = pprint.PrettyPrinter(indent=4, width=250)
pp.pprint(result)

输出:

[   (   2020,
        ['The Role Of Public Procurement As Innovation Lever: Evidence From Italian Manufacturing Firms', 'A voyage in the role of territory: are territories capable of instilling their peculiarities in local production systems'],
        ['Francesco Crespi & Serenella Caravella', 'Cristina Vaquero-Piñeiro']),
    (   2019,
        [   'Probability Forecasts and Prediction Markets',
            'R&D Financing And Growth',
            'Mission-Oriented Innovation Policies: A Theoretical And Empirical Assessment For The Us Economy',
            'Public Investment Fiscal Multipliers: An Empirical Assessment For European Countries',
            'Consumption Smoothing Channels Within And Between Households',
            'A critical analysis of the secular stagnation theory',
            'Further evidence of the relationship between social transfers and income inequality in OECD countries',
            'Capital accumulation and corporate portfolio choice between liquidity holdings and financialisation'],
        [   'Julia Mortera & A. Philip Dawid',
            'Luca Spinesi & Mario Tirelli',
            'Matteo Deleidi & Mariana Mazzucato',
            'Enrico Sergio Levrero & Matteo Deleidi & Francesca Iafrate',
            'Simone Tedeschi & Luigi Ventura & Pierfederico Asdrubal',
            'Stefano Di Bucchianico',
            "Giorgio D'Agostino & Luca Pieroni & Margherita Scarlato",
            'Giovanni Scarano']),
    (   2018, ...

我使用列表理解来理解.通过使用列表推导并使用next.sibling为再次在类 panel-body 的每个div元素中的所需元素的title_list和title_list后面追加一个列表,从而获得了标题和作者.> i 元素以获取作者.然后,我压缩了三个列表并将结果转换为列表.最后,我很漂亮地打印了结果.

I got the years using a list comprehension. I got the titles and authors by appending a list to the title_list and title_list for the required elements in each div element with the class panel-body again using a list comprehension and using next.sibling for the i element to get the authors. Then I zipped the three lists and cast the result to a list. Finally I pretty printed the result.

这篇关于从IDEAS中提取学术出版物信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆