循环从 R 中的 Wikipedia 抓取数据 [英] Loop to scrape data from Wikipedia in R

查看:39
本文介绍了循环从 R 中的 Wikipedia 抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正尝试提取有关名人/显着死亡的数据以进行分析.Wikipedia的html路径具有非常常规的结构,涉及明显的死亡日期.看起来像:

I am trying to extract data about celebrity/notable deaths for analysis. Wikipedia has a very regular structure to their html paths concerning notable dates of death. It looks like:

https://en.wikipedia.org/wiki/Deaths_in_"MONTH"_"YEAR"

例如,此链接导致2014年3月显着死亡.

For example, this link leads to the notable deaths in March, 2014.

https://en.wikipedia.org/wiki/Deaths_in_March_2014

我已经找到了列表的CSS位置,我需要将其定位为#mw-content-text h3 + ul li",并将其成功提取为特定链接.现在,我尝试编写一个循环以遍历我选择了几个月或任何年份.我认为这是一个非常简单的嵌套循环,但在2015年进行测试时遇到了错误.

I have located the CSS location of the lists I need to be ""#mw-content-text h3+ ul li" and extracted it for a specific link successfully. Now I'm trying to write a loop to go through the months and any years that I choose. I think it's a pretty straightforward nested loop but I'm getting errors when testing it just on 2015.

library(rvest)
data = data.frame()
 mlist = c("January","February","March","April","May","June","July","August",
              "September","October","November","December")

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)
    data = rbind(data,text,stringsAsFactors=FALSE)
      }
 }

当我注释掉该行时:

data = rbind(data,text,stringsAsFactors=FALSE)

没有错误返回,因此与该位显然相关.我也将我的整个代码发布给其他评论.这里的目标是循环多年,然后专注于这些年和几个月中的分布.为此,我只需要保留死亡的年龄,月份和年份.

no errors are returned so it's clearly related to this bit. I am posting my whole code for other comments as well. The goal here is to loop through many years and then focus on the distribution over the years and months. For this I just need to keep the age, month, and year of death.

谢谢!

对不起,它们在技术上是警告,不是错误.我有超过50个,当我尝试查看数据"时,情况非常糟糕.

Sorry, they are technically warnings, not errors. I get over 50 of them and when I try to look at "data" it is a giant mess.

当我不在一个特定的URL上循环运行此代码时,它可以正常工作并返回可读的输出.

When I run this code not as a loop on one specific URL, it works fine and returns a readable output.

site = read_html("https://en.wikipedia.org/wiki/Deaths_in_January_2015")
fnames = html_nodes(site,"#mw-content-text h3+ ul li")
text = html_text(fnames)

以下是该数据集中的几行:

Here are a couple of rows from that data set:

text[1:5]
[1] "Barbara Atkinson, 88, British actress (Z-Cars).[1]"                                         
[2] "Staryl C. Austin, 94, American air force brigadier general.[2]"                             
[3] "Ulrich Beck, 70, German sociologist, heart attack.[3]"                                      
[4] "Fiona Cumming, 77, British television director (Doctor Who).[4]"                            
[5] "Eric Cunningham, 65, Canadian politician, Ontario MPP for Wentworth North (1975â€"1984).[5]"

推荐答案

html_text(fnames)返回一个数组.您的问题是尝试将数组追加到数据框上.
在附加之前,尝试将变量 text 转换为数据框:

html_text(fnames) returns an array. Your problem is trying append an array onto a dataframe.
Try converting your variable text to a dataframe before appending:

for (y in 2015:2015){
  for (m in 1:12){
    site = read_html(paste("https://en.wikipedia.org/wiki/Deaths_in_",mlist[m],
           "_",y,collapse=""))
    fnames = html_nodes(site,"#mw-content-text h3+ ul li")
    text = html_text(fnames)

    temp<-data.frame(text, stringsAsFactors = FALSE)

    data = rbind(data,temp)
    }
 } 

出于性能原因,这不是最佳技术.每次通过循环时,都会重新分配数据帧的内存,这会降低性能,因为这是一次事件,在这种情况下,应限制请求的数量.

This is not the best technique for the performance reasons. Each time through the loop, the memory for the dataframe is reallocated which slows performance, with this being a one time event and a limit number of requests it should be manageable in this case.

这篇关于循环从 R 中的 Wikipedia 抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆