我在 data.table 中复制行的方式有效吗? [英] Is my way of duplicating rows in data.table efficient?

查看:11
本文介绍了我在 data.table 中复制行的方式有效吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个 data.table 中有月度数据,在另一个 data.table 中有年度数据,现在我想将年度数据与月度中的相应观察值相匹配数据.

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.

我的做法是:复制每个月的年度数据,然后加入月度和年度数据.现在我有一个关于行重复的问题.我知道该怎么做,但我不确定这是否是最好的方法,所以一些意见会很棒.

My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.

这是我的年度数据以及我目前如何复制的示例性 data.table DT:

Here is an exemplatory data.table DT for my annual data and how I currently duplicate:

library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
                    values = 10:15,
                    startMonth = seq(from=1, by=2, length=6),
                    endMonth = seq(from=3, by=3, length=6))
DT
      ID values startMonth endMonth
[1,] a_1     10          1        3
[2,] a_2     11          3        6
[3,] a_3     12          5        9
[4,] b_1     13          7       12
[5,] b_2     14          9       15
[6,] b_3     15         11       18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT,  ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
[...]

最后一次加入正是我想要的.但是,DT[, list(MONTH=startMonth:endMonth), by="ID"] 已经做了我想做的所有事情,除了将其他列添加到 DT,所以我是想知道我是否可以摆脱代码中的最后三行,即 setkeyjoin 操作.事实证明,您可以,只需执行以下操作:

The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:

#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
 ID MONTH values startMonth endMonth
a_1    1     10          1        3
a_1    2     10          1        3
a_1    3     10          1        3
a_2    3     11          3        6
...

然而,这只是因为我将列名硬编码到 list 表达式中.在我的真实数据中,我事先不知道所有列的名称,所以我想知道是否可以告诉 data.table 返回列 MONTH 我计算如上所示以及 DT 的所有其他列..SD 似乎可以解决问题,但是:

This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:

DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") : 
  maxn (4) is not exact multiple of this j column's length (3)

总而言之,我知道它是如何完成的,但我只是想知道这是否是最好的方法,因为我仍在为 data.table 并且经常在帖子和 wiki 上阅读到做事的方式有好有坏.另外,我不太明白为什么在使用 .SD 时会出错.我认为这是告诉 data.table 你想要所有列的任何简单方法.我错过了什么?

So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?

推荐答案

好问题.你尝试的非常合理.假设您使用的是 v1.7.1,现在可以更轻松地制作 list 列.在这种情况下,它试图在第二组的 MONTH 列(4 项)旁边从 .SD (3 项)中创建一个 list 列.我会把它作为一个错误提出来,谢谢.

Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug , thanks.

与此同时,尝试:

DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
 ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
...

另外,只是为了检查一下您是否看过 roll=TRUE?通常,您只有一个 startMonth 列(不规则有间隙),然后只需 roll 加入它.但是,您的示例数据具有重叠的月份范围,因此使其复杂化.

Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.

这篇关于我在 data.table 中复制行的方式有效吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆