根据跨另一个(摘要)数据帧中几列的键为数据帧设置子集 [英] Subsetting a data frame based on key spanning several columns in another (summary) data frame

查看:42
本文介绍了根据跨另一个(摘要)数据帧中几列的键为数据帧设置子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框 a ,其中有4个标识列: A,B,C,D 。使用 ddply()创建的第二个数据框 b 包含不同<$ c的所有值的摘要每一组 A,B,C 中的$ c> D s。第三个数据框 c 包含 b 的子集,该子集具有我要从 a 。

I have a data frame a with 4 identifying columns: A, B, C, D. A second data frame b, created with ddply(), contains a summary of all the values for different Ds for every set of A,B,C. A third data frame c contains a subset of b with bad values that I want to delete from a.

因此,我希望从 a 中提取一个子集,由 A,B,C 组合标识的行,也出现在 c 中。我可以想出一个循环(丑陋且效率低下)的方法,但是,我的DBA背景鼓励我寻求一个更……直接的解决方案。

Thus, I want a subset from a, omitting all the rows identified by a combination of A,B,C that are also present in c. I can think of ways do this (ugly and inefficiently) in a loop, but, my DBA background encourages me to seek a solution that is a little bit more … direct.

在代码中:

a <- data.frame(
  A=rep(c('2013-10-30', '2014-11-6'), each=16*20),
  B=rep(1:8, each=2*20),
  C=rep(1:4, each=20),
  D=1:20
)

a$Val=rnorm(nrow(a))

library(plyr)
b <- ddply(a, ~B+C+A, summarise,
           mean_Val=mean(Val))

# Some subset criteria based on AOI group values
c <- subset(b, mean_Val <= 0)

# EDIT: Delete all the rows from a for which the
# key-triplets A,B,C are present in c
for (i in 1:nrow(c)) {
  c_row = c[i,]
  a <- a[ which( !(a$A==c_row$A & a$B==c_row$B & a$C==c_row$C) ), ]
}
# This is the loopy type of 'solution' I didn't want to use

请随时在我的问题中解决不确定性。如果您可以指出正确的方向,我将很乐意进行编辑。

Please feel free also to address unclarities in my question. I'd be happy to edit if you can point me in the right direction.

推荐答案

如果我们已经创建了3个数据集并且想要根据 c / c1的元素对第一个 a进行子集化,一个选项是 dplyr 中的 anti_join

If we already created 3 datasets and want to subset the first "a" based on the elements of "c/c1", one option is anti_join from dplyr

library(dplyr)
anti_join(a, c1, by=c('A', 'B', 'C'))



更新



或者我们可以使用 base R 选项和 interaction 选项将感兴趣的列粘贴到两个数据集中,并检查是否使用%in%将第二个('c')的元素放在第一个('a')中。逻辑索引可用于子集 a。

Update

Or we could use a base R option with interaction to paste the columns of interest together in both datasets and check whether the elements of 2nd ('c') are in 1st ('a') using %in%. The logical index can be used to subset "a".

 a1 <- a[!(as.character(interaction(a[1:3], sep=".")) %in% 
          as.character(interaction(c[LETTERS[1:3]], sep="."))),]

或者就像@David Arenburg提到的那样,我们可能不需要创建 b c 数据集以获取预期的输出。使用 plyr ,在中使用 mutate 和<$ c $创建新的均值列( mean_Val) c>子集均值大于0的行( mean_Val> 0

Or as @David Arenburg mentioned, we may not need to create b, or c datasets to get the expected output. Using plyr, create a new mean column ("mean_Val") in "a" with mutate and subset the rows with mean greater than 0 (mean_Val >0)

 library(plyr)
 subset(ddply(a, ~B+C+A, mutate, mean_Val=mean(Val)), mean_Val>0)

或使用 dplyr

 library(dplyr)
  a %>%
     group_by(B, C, A) %>%
     mutate(mean_Val=mean(Val)) %>% 
     filter(mean_Val>0)

或者,如果我们不需要均值值作为 a中的列,则从 base R ave

Or if we don't need the "mean" values as a column in "a", ave from base R could be used as well.

  a[!!with(a, ave(Val, B, C, A, FUN=function(x) mean(x)>0)),]

如果需要保持 mean_Val 列(@David Arenburg提出的变体)

If we need to keep the mean_Val column (a variation proposed by @David Arenburg)

  subset(transform(a, Mean_Val = ave(Val, B, C, A, FUN = mean)),
                 Mean_Val > 0)



数据



data

set.seed(24)
a <- data.frame(A= sample(LETTERS[1:3], 20, replace=TRUE), 
   B=sample(LETTERS[1:3], 20, replace=TRUE), C=sample(LETTERS[1:3], 
         20, replace=TRUE), D=rnorm(20))

b <- a %>% 
       group_by(A, B, C) %>% 
       summarise(D=sum(D))
set.seed(39)
c1 <- b[sample(1:nrow(b), 6, replace=FALSE),]

这篇关于根据跨另一个(摘要)数据帧中几列的键为数据帧设置子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆