在因子组合中添加缺失的行 [英] Add missing rows within combinations of factors
问题描述
我有一个最近似的数据框:
I have a data frame that's maybe best approximated as:
library(data.table)
z <- rep("z",5)
y <- c(rep("st",2),rep("co",2),"fu")
var1 <- c(rep("a",2),rep("b",2),"c")
var2 <- c("y","y","y","z","x")
transp <- c("bus","plane","train","bus","bus")
sample1 <- sample(1:10, 5)
sample2 <- sample(1:10, 5)
df <- cbind(z,y,var1,var2,transp,sample1,sample2)
df<-as.data.table(df)
> df
z y var1 var2 transp sample1 sample2
1: z st a y bus 4 3
2: z st a y plane 10 7
3: z co b y train 8 9
4: z co b z bus 1 5
5: z fu c x bus 6 4
全部表中已经存在var1和var2的唯一组合。我想扩展该表,以便var1 / var2的所有组合都包括列表中找到的所有transp选项:
All unique combinations of var1 and var2 already exist in the table. I want to expand the table so that all combinations of var1/var2 include all transp options found in a list:
transtype <- c("bus","train")
注意, plane是df中的一个选项,但不是转型的。我想保留包含transp = plane的行,但不通过添加带有 plane的行来扩展。 z和y列需要填充适当的值,并且sample1和sample2应该为NA。结果应为:
Notice "plane" is an option in df but not in transtype. I would like to keep the row that includes transp="plane" but not expand by adding rows with "plane". The columns z and y need to be filled in with the appropriate value and sample1 and sample2 should be NA. Result should be:
> result
z y var1 var2 transp sample1 sample2
1: z st a y bus 4 3
2: z st a y plane 10 7
3: z st a y train NA NA
4: z co b y train 8 9
5: z co b y bus NA NA
6: z co b z bus 1 5
7: z co b z train NA NA
8: z fu c x bus 6 4
9: z fu c x train NA NA
数据。我根据在data.frame中添加缺少值的行的最快方法?和 Data.table:为缺少2个因子的组合添加行而不会丢失相关的描述性因子最终扩展了var1和var2的所有唯一组合,而不仅仅是表中已经存在的组合。而且我不知道如何保持z和y的值。像这样:
The data.table options I've come up with based on Fastest way to add rows for missing values in a data.frame? and Data.table: Add rows for missing combinations of 2 factors without losing associated descriptive factors end up expanding all unique combinations of var1 and var2, not just the combinations that already exist in the table. And I don't know how to keep the values of z and y. Like this:
setkey(df, var1, var2, transp)
x<-df[CJ(var1, var2, transp, unique=T)]
也许我应该使用dplyr吗?或者也许我缺少一些简单的东西?我遍历了data.table文档,无法提出解决方案。
Maybe I should be using dplyr? Or maybe I'm missing something simple? I went through the data.table documentation and can't come up with a solution.
推荐答案
仅获得 df
中已经存在,最好使用 by
创建一个新的引用数据表,然后将其合并回到原来的那个。
To get only the unique combination that already exist in df
, it might be better to use by
to create a new reference data.table and then merge that back with the original one.
使用:
df2 <- df[, .(transp = transtype), by = .(var1,var2)]
merge(df, df2, by = c('var1','var2','transp'), all = TRUE)
给出:
var1 var2 transp z y sample1 sample2
1: a y bus z st 4 3
2: a y plane z st 10 7
3: a y train NA NA NA NA
4: b y bus NA NA NA NA
5: b y train z co 8 9
6: b z bus z co 1 5
7: b z train NA NA NA NA
8: c x bus z fu 6 4
9: c x train NA NA NA NA
如果您不将 z
和 y
列设置为 NA
值,您可以这样做:
If you don't the z
and y
columns to have NA
-values, you could do:
df2 <- df[, .(transp = transtype), by = .(var1,var2,z,y)]
merge(df, df2, by = c('var1','var2','transp','z','y'), all = TRUE)
给出:
var1 var2 transp z y sample1 sample2
1: a y bus z st 4 3
2: a y plane z st 10 7
3: a y train z st NA NA
4: b y bus z co NA NA
5: b y train z co 8 9
6: b z bus z co 1 5
7: b z train z co NA NA
8: c x bus z fu 6 4
9: c x train z fu NA NA
注意:如果 z
和 y
列对于每个 var1
/ var2
组合,最好使用 zoo
包中的 na.locf
在 z
和 y
列中填充 NA
值。
NOTE: If the z
and y
columns have more than one unique value for each var1
/var2
combo, it is better to use na.locf
from the zoo
package to fill the NA
-values in the z
and y
columns.
使用的数据:
df <- fread("z y var1 var2 transp sample1 sample2
z st a y bus 4 3
z st a y plane 10 7
z co b y train 8 9
z co b z bus 1 5
z fu c x bus 6 4")
这篇关于在因子组合中添加缺失的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!