R将时间序列中的重复行与数据表中的不同列类型进行组合 [英] R combining duplicate rows in a time series with different column types in a datatable

查看:187
本文介绍了R将时间序列中的重复行与数据表中的不同列类型进行组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题是基于另一个问题 R通过ID将重复的行与数据帧中的不同列类型组合在一起。我有一个数据表,其中有一个列 time 和其他一些不同类型(因数和数值)的列。下面是一个示例:

  dt<-data.table(time = c(1,1,1,1,1 ,2,2,3,3,4,4,4,4),
abst = c(0,NA,2,NA,NA,NA,0,0,NA,2,NA,3, 4),
farbe = as.factor(c( keine,NA, keine,NA,NA,NA, keine, keine,NA,NA,NA, rot, blau)),
gier = c(0,NA,5,NA,NA,NA,0,0,NA,1,NA,6,2),
goff = as.factor( c( haus, maus, toll,NA, haus,NA, maus,NA,NA,NA,NA,NA, maus))),
huft = as。 factor(c(NA,NA,NA,NA,NA, wolle,NA,NA, wolle,NA,NA, holz,NA)),
模式= c(4,2, NA,NA,6、5、0,NA,NA,NA,NA,NA,3))

现在,我想在 time 列中组合重复的时间。数字列定义为所有相同ID的平均值(不包含NA!)。因子列合并为一个。可以省略NA。

  dtRes<-data.table(time = c(1,1,1,2,2, 3,4,4),
abst = c(1,1,1,0,0,3,3),
farbe = as.factor(c( keine, keine, keine, keine, keine, rot, blau))),
gier = c(2.5,2.5,2.5,0,0,3,3),
goff = as.factor(c( haus, maus, toll, maus,NA, maus, maus))),
huft = as.factor(c(NA,NA ,NA, wolle, wolle, holz, holz))),
模式= c(4、4、4、2.5,NA,3、3))

我需要对此进行一些快速计算,因为我有大约一百万个观测值。



对此问题有一些额外的想法:农场可能不是唯一的。在这种情况下,我认为对我的数据最好的方法是有一个重复的行,但只使用不同的 farbe ,因此有2个相同的时间,其余时间保持不变但 farbe 的值不同。



另外:我的真实数据中有很多数值和因子列,所以我没有想要分别定义每一个列。在某些数据表中没有因子列。因此,即使没有数值列(时间总是存在数值列)或因子列,该解决方案也必须有效。


解决方案

(对于给定的样本数据集)预期结果也可以在没有<随后调用 separate_rows() cSplit()

  library(data.table)#版本1.12.9 
dt [,lapply(.SD,function(x)if(is.numeric(x) )mean(x,na.rm = TRUE)
else unlist(na.omit(unique(x)))),按= time]

时间abst farbe gier goff huft模式
1:1 1 keine 2.5 haus< NA> 4.0
2:1 1 keine 2.5 maus< NA> 4.0
3:1 1 keine 2.5通行费< NA> 4.0
4:2 0 keine 0.0莫尔斯·沃尔2.5 2.5 b.b $ b 5:3 0 keine 0.0< NA> wolle NaN
6:4 3腐烂3.0 maus holz 3.0
7:4 3 blau 3.0 maus holz 3.0

请注意,该方法适用于数字列和因子列的任意混合; no 列名需要明确说明。






但是,我确实相信以下问题的正确答案是返回时间 一行,而不是部分总计(当然,您的里程可能会有所不同):

  dt [,lapply(.SD,function(x)if(is.numeric(x))mean(x,na.rm = TRUE)
else list (na.omit(unique(x)))),按=时间]




 时间abst farbe gier goff huft模式
1:1 1 keine 2.5 haus,maus,toll​​ 4.0
2:2 0 keine 0.0 maus wolle 2.5
3:3 0 keine 0.0 wolle NaN
4:4 3腐烂,blau 3.0 maus holz 3.0




请注意,使用 list()代替 toString()已用于汇总因子列。如果因子级别之一包括逗号,这可以避免出现问题。此外,在大型生产数据集中,每具有非唯一因素的案例更容易识别:

 #如前
dtRes<-dt [,lapply(.SD,function(x),if(is.numeric(x))mean(x,na.rm = TRUE)
else list(na.omit(unique(x)))),by = time]
#查找每组中具有非唯一因素的个案
#注意.SDcols = is。列表可用data.table版本1.12.9
tmp<-dtRes [,其中(Reduce(sum,lapply(.SD,function(x)lengths(x)> 1L))> 0) ,.SDcols = is.list,by = time]
tmp




 时间V1 
1:1 1
2:4 1




 #显示受影响的行
dtRes [tmp,on = time]




 时间abst farbe gier goff huft模式V1 
1:1 1 keine 2.5 haus,maus,toll​​ 4 1
2:4 3 rot,blau 3.0 maus holz 3 1




 #显示不受影响的行
dtRes [!tmp,on = time]




 时间abst farbe gier goff huft模式
1:2 0 keine 0 maus wolle 2.5
2:3 0 keine 0 wolle NaN



This question is building up on another question R combining duplicate rows by ID with different column types in a dataframe. I have a datatable with a column time and some other columns of different types (factors and numerics). Here is an example:

dt <- data.table(time  = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
             abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
             farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "blau")),
             gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2),
             goff  = as.factor(c("haus", "maus", "toll", NA, "haus", NA, "maus", NA, NA, NA, NA, NA, "maus")),
             huft  = as.factor(c(NA, NA, NA, NA, NA, "wolle", NA, NA, "wolle", NA, NA, "holz", NA)),
             mode  = c(4, 2, NA, NA, 6, 5, 0, NA, NA, NA, NA, NA, 3))

Now I want to combine the duplicate times in column time. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

dtRes <- data.table(time  = c(1, 1, 1, 2, 3, 4, 4),
                abst  = c(1, 1, 1, 0, 0, 3, 3),
                farbe = as.factor(c("keine", "keine", "keine", "keine", "keine", "rot", "blau")),
                gier  = c(2.5, 2.5, 2.5, 0, 0, 3, 3),
                goff  = as.factor(c("haus", "maus", "toll", "maus", NA, "maus", "maus")),
                huft  = as.factor(c(NA, NA, NA, "wolle", "wolle", "holz", "holz")),
                mode  = c(4, 4, 4, 2.5, NA, 3, 3))

I need some fast calculation for this, because I have about a million observations.

Some extra thoughts to this problem: farbe may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different farbe, so there are 2 identical times and all the rest stays the same but different values for farbe. This should be just very rare case, but would be a great addition.

Also: I have a lot more numeric and factor columns in my real data so I don't want to define every single column separately. In some data tables there are no factor columns. So the solution has to work even if there are no numeric (time is always there and numeric) or factor columns.

Thx in advance!

解决方案

The expected result (for the given sample dataset) can also be achieved without a subsequent call to separate_rows() or cSplit():

library(data.table) # version 1.12.9
dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
            else unlist(na.omit(unique(x)))), by = time]

   time abst farbe gier goff  huft mode
1:    1    1 keine  2.5 haus  <NA>  4.0
2:    1    1 keine  2.5 maus  <NA>  4.0
3:    1    1 keine  2.5 toll  <NA>  4.0
4:    2    0 keine  0.0 maus wolle  2.5
5:    3    0 keine  0.0 <NA> wolle  NaN
6:    4    3   rot  3.0 maus  holz  3.0
7:    4    3  blau  3.0 maus  holz  3.0

Please, note that this approach will work for an arbitrary mix of numeric and factor columns; no column names need to be stated explicitly.


However, I do believe the correct answer to the underlying problem is to return one row per time instead of a kind of partial aggregate (your mileage may vary, of course):

dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]

   time abst    farbe gier           goff  huft mode
1:    1    1    keine  2.5 haus,maus,toll        4.0
2:    2    0    keine  0.0           maus wolle  2.5
3:    3    0    keine  0.0                wolle  NaN
4:    4    3 rot,blau  3.0           maus  holz  3.0

Please, note that list() instead of toString() has been used to aggregate the factor columns. This has the benefit to avoid problems in case one of the factor levels includes a comma , by chance. Furthermore, it is easier to identify cases with non-unique factors per time in a large production dataset:

# compute aggregate as before
dtRes <- dt[, lapply(.SD, function(x) if (is.numeric(x)) mean(x, na.rm = TRUE) 
                   else list(na.omit(unique(x)))), by = time]
# find cases with non-unique factors per group
# note .SDcols = is.list is available with data.table version 1.12.9
tmp <- dtRes[, which(Reduce(sum, lapply(.SD, function(x) lengths(x) > 1L)) > 0), .SDcols = is.list, by = time]
tmp

   time V1
1:    1  1
2:    4  1

# show affected rows
dtRes[tmp, on = "time"]

   time abst    farbe gier           goff huft mode V1
1:    1    1    keine  2.5 haus,maus,toll         4  1
2:    4    3 rot,blau  3.0           maus holz    3  1

# show not affected rows
dtRes[!tmp, on = "time"]

   time abst farbe gier goff  huft mode
1:    2    0 keine    0 maus wolle  2.5
2:    3    0 keine    0      wolle  NaN

这篇关于R将时间序列中的重复行与数据表中的不同列类型进行组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆