根据一列进行广播/合并,其值在一定范围内 [英] Dcast/merge based on a column, with a value within a certain range

查看:78
本文介绍了根据一列进行广播/合并,其值在一定范围内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个面板数据集:panel和一个带有事件列表的数据集:Events.对于面板数据集,相等的panelID表示两个观察值属于同一个.

I have a panel dataset: panel and a dataset with a list of events: Events. For the panel dataset, an equal panelID shows that two observations belong together.

panelID = c(1:50)   
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")

n <- 2

library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
                 country = rep(sample(country, length(panelID), replace = T), each = n),
                 year = c(replicate(length(panelID), sample(year, n))),
                 some_NA = sample(0:5, 6),                                             
                 some_NA_factor = sample(0:5, 6),         
                 norm = round(runif(100)/10,2),
                 Income = round(rnorm(10,-5,5),2),
                 Happiness = sample(10,10),
                 Sex = round(rnorm(10,0.75,0.3),2),
                 Age = sample(100,100),
                 Educ = round(rnorm(10,0.75,0.3),2))        
Panel[, uniqueID := .I]                                                                        # Creates a unique ID     
Panel[Panel == 0] <- NA    

Events <- fread(
"Event_Type  country year   
A   NLD   2005
A   NLD   2004       
A   GBR   2006
A   GBR   2003   
A   GRC   2002             
A   GRC   2007",
header = TRUE)

我想知道每年两次分解出的面板观测结果之间发生Events的频率.例如,对于使用panelID == 2进行的小组观察,在国家NLD中,在该小组观察的years之中或之间有两个事件,即2004年和2005年.因此:

I want to know how often Events happen between the panel obervations, split out per year. As an example, for the panel observations with panelID == 2, in the country NLD there are two events, in or in between the years of that panel observation, namely in 2004 and 2005. Hence:

期望的输出:

panleID country year 2002  2003  2004 2005 2006 2007 
2       NLD     2004 NA    NA    1    1    NA   NA 
2       NLD     2007 NA    NA    1    1    NA   NA 

基于 Cole 我尝试执行以下操作:

Based on the solution by Cole I tried to do the following:

# cast Event 
Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")

# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]

# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['year']]))

# non-equi update join
Panel[Events_cast,
      on = .(country,
             start <= year,
             end >= year),
      (cols) := mget(cols)]

#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]

Panel

但是在# non-equi update join处出现错误:Error in [.data.table (Panel, Events, on = .(country, : LHS of := appears to be column positions but are outside [1,ncol] range. New columns can only be added by name.

推荐答案

尝试使用年份来确定您要选择的列.错误告诉您2006,并且其他年份不是有效的列号.解决方法很简单:

data.table is trying to use the years to figure out which column you are selecting. The error is telling you 2006 and other years are not valid column numbers. The fix is easy:

cols <- as.character(sort(unique(Events[['year']])))

这是所有内容以及其他一些更改,包括:

And here's everything together with a few other changes including:

  1. 使用data.table::dcast代替reshape2::dcast
  2. Events数据表中添加startend并使用这些列进行转换.
  1. Using data.table::dcast instead of reshape2::dcast
  2. Adding start and end to the Events data.table and casting with those columns.

# cast Event 
# Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
Events[, `:=`(start = min(year), end = max(year)), by = country]
Events_cast <- dcast(Events, country + start + end~ year, length)

# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]

# dcast sorts the rhs alphabetically
cols <- as.character(sort(unique(Events[['year']])))

# non-equi update join
# Panel[Events_cast,
#       on = .(country,
#              start <= year,
#              end >= year),
#       (cols) := mget(cols)]

Panel[Events_cast,
      on = .(country,
             start <= start,
             end >= end),
      (cols) := mget(cols)]

#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]

Panel

这篇关于根据一列进行广播/合并,其值在一定范围内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆