按组填写缺少的日期 [英] Filling missing dates by group

查看:61
本文介绍了按组填写缺少的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据集:

I have a data set that looks like this:

shop_id,item_id,time,value
150,1,2015-07-10,3
150,1,2015-07-11,5
150,1,2015-07-13,2
150,2,2015-07-10,15
150,2,2015-07-12,12

每个内

我希望将这个不规则的时间序列扩展为一个固定的时间序列,每个序列中都有连续的日期组:

I wish to expand this irregular the time series to a regular, with consecutive dates, within each group:

shop_id,item_id,time,value
150,1,2015-07-10,3
150,1,2015-07-11,5
150,1,2015-07-12,0 # <~~ added
150,1,2015-07-13,2
150,2,2015-07-10,15
150,2,2015-07-11,0 # <~~ added
150,2,2015-07-12,12

对于添加的日期,相应的值应为零。虽然(使用R或SQL合并),但我看到的大多数解决方案都不涉及GROUP BY。

For the dates which are added, the corresponding values should by zero. I've read very similar questions though (either using R or SQL coalescing), but most of the solutions I've seen doesn't involve GROUP BYs.

基本上,我可以访问SQL数据库/我可以导出为CSV,以便最好在C#中进行操作。希望能找到可以执行此类数据操作但找不到任何内容的C#库。

Basically I have access to the SQL database/I can export as CSV for manipulation preferably in C#. Was hoping to find C# libraries that can do such data manipulation but couldn't find any.

任何建议或帮助都值得赞赏!

Any advice or help is appreciated!

推荐答案

您可以使用 R 中的 data.table 。假设时间列为日期类,

You can use data.table from R. Assuming that 'time' column is of 'Date' class,

library(data.table)#v1.9.5+
DT1 <- setDT(df1)[, list(time=seq(min(time), max(time), by ='day')),
                    by =.(shop_id, item_id)]
setkeyv(df1, names(df1)[1:3])[DT1][is.na(value), value:=0]
#   shop_id item_id       time value
#1:     150       1 2015-07-10     3
#2:     150       1 2015-07-11     5
#3:     150       1 2015-07-12     0
#4:     150       1 2015-07-13     2
#5:     150       2 2015-07-10    15
#6:     150       2 2015-07-11     0
#7:     150       2 2015-07-12    12

在开发版本中,您也可以在不设置'key'的情况下执行此操作。 此处的安装说明

In the devel version, you can also do this without setting the 'key'. Instructions to install the devel version are here

 df1[DT1, on =c('shop_id', 'item_id', 'time')][is.na(value), value:=0]
 #   shop_id item_id       time value
 #1:     150       1 2015-07-10     3
 #2:     150       1 2015-07-11     5
 #3:     150       1 2015-07-12     0
 #4:     150       1 2015-07-13     2
 #5:     150       2 2015-07-10    15
 #6:     150       2 2015-07-11     0
 #7:     150       2 2015-07-12    12

或者如@Arun所建议的,更有效的选择是

Or as @Arun suggested, a more efficient option would be

 DT1[, value := 0L][df1, value := i.value, on = c('shop_id', 'item_id', 'time')]
 DT1 

这篇关于按组填写缺少的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆