重复的行:根据条件选择行并存储重复的值 [英] Duplicated rows: select rows based on criteria and store duplicated values

查看:124
本文介绍了重复的行:根据条件选择行并存储重复的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理看起来像这样的原始数据集:

I am working on a raw dataset that looks something like this:

df <- data.frame("ID" = c("Alpha", "Alpha", "Alpha", "Alpha", 
                          "Beta","Beta", "Beta","Beta" ),
                 "treatment"= LETTERS[seq(from = 1, to = 8)],
                 "Year" = c(1970, 1970, 1980, 1990, 1970, 1980, 
                            1980,1990),
                 "Val" = c(0,0,0,1,0,1,0,1),
                 "Val2" = c(0,2.34,1.3,0,0,2.34,3.2,1.3))

数据有点脏,因为我对每个ID和Year标识符有多个观察结果-例如我在1970年的Alpha中有2个不同的行.在1980年的Beta中也是如此.

The data is a bit dirty as I have multiple observations for each ID and Year identifier - e.g. I have 2 different rows for Alpha in 1970. The same holds for Beta in 1980.

问题在于,感兴趣的变量Val Val2在重复的行中具有不同的分数(以id/year计).

The issue is that the variable of interest Val Val2 have different scores in the duplicated rows (in terms of id/year).

我想找到一种简洁的方法来产生以下最终数据帧:

I would like to find a concise way to produce the following final dataframe:

final <- data.frame("ID" = c("Alpha", "Alpha", "Alpha", 
                             "Beta", "Beta","Beta" ),
                    "treatment"= c("B","C","D","E","G","H"),
                    "Year" = c(1970, 1980, 1990, 1970, 
                               1980,1990),
                    "Val" = c(0,0,1,0,0,1),
                    "Val2" = c(2.34,1.3,0,0,3.2,1.3),
                    "del_treat" = c("A",NA,NA,NA,"F",NA),
                    "del_Val"=c(0,NA,NA,NA,1,NA),
                    "del_Val2"=c(0,NA,NA,NA,2.34,NA))

逻辑如下:

1)我希望每个ID/年只有一个obs

1) I want to have only one obs for every ID/year

2)我只想保留Val2类别中具有较高值的​​观察值.

2) I want only to retain the observation with a higher value in the Val2 category.

3)我想将已删除的行值存储到单独的列中,以跟踪我要删除的del_treatdel_Valdel_Val2.

3) I would like to store the deleted rows values into separate columns to keep track of what I am deleting del_treat, del_Val and del_Val2.

进行说明.在df中,对Alpha/1970有重复的观察.我想将其减少到一行. Val2取值为0和2.34,在最终数据帧中,仅保留2.34.但是,处理A的值在新创建的列del_treatdel_Valdel_Val2中报告.

To illustrate. In df there is a duplicated observation for Alpha/1970. I want to reduce this to a single row. Val2 takes the value of 0 and 2.34, and in the final data frame, only 2.34 is retained. However, the values of treatment A are reported in newly created columns del_treat, del_Val and del_Val2.

我能够根据Val2``setDT(df)[order(-Val2)][,.SD[1,], by = .(ID, Year)]选择行 值,但我想找到一种简洁的方法来将删除的结果也存储"到新列中

I am able to select rows based on the Val2``setDT(df)[order(-Val2)][,.SD[1,], by = .(ID, Year)] value, but I would like to find a concise way to also 'store' the results deleted into the new columns

推荐答案

使用data.table,基于rowid(ID, Year)的dcast在按Val2降序排序后到达您,列名称除外. "_1"列是保持"列,"_ 2"列是"del"列.

Using data.table, a dcast based on rowid(ID, Year) after ordering by Val2 descending gets you there with the exception of column names. The "_1" columns are the "keep" columns, and the "_2" columns are the "del" columns.

library(data.table)
setDT(df)

setorder(df, ID, Year, -Val2)

out <- 
  dcast(df, ID + Year ~ rowid(ID, Year), value.var = c('treatment', 'Val', 'Val2'))
out
#       ID Year treatment_1 treatment_2 Val_1 Val_2 Val2_1 Val2_2
# 1: Alpha 1970           B           A     0     0   2.34   0.00
# 2: Alpha 1980           C        <NA>     0    NA   1.30     NA
# 3: Alpha 1990           D        <NA>     1    NA   0.00     NA
# 4:  Beta 1970           E        <NA>     0    NA   0.00     NA
# 5:  Beta 1980           G           F     0     1   3.20   2.34
# 6:  Beta 1990           H        <NA>     1    NA   1.30     NA

我们可以更改名称以匹配您的名称,唯一的区别是del列的末尾有一个数字.如果每个组的行数大于2,将很有用.

We can change the names to match yours, only difference is the del columns have a number at the end. Would be useful if there is possiblity of > 2 rows per group.

setnames(out, function(x) gsub('(.*)_1', '\\1', x))
setnames(out, function(x) gsub('(.*_\\d+)', 'del_\\1', x))
out
#       ID Year treatment del_treatment_2 Val del_Val_2 Val2 del_Val2_2
# 1: Alpha 1970         B               A   0         0 2.34       0.00
# 2: Alpha 1980         C            <NA>   0        NA 1.30         NA
# 3: Alpha 1990         D            <NA>   1        NA 0.00         NA
# 4:  Beta 1970         E            <NA>   0        NA 0.00         NA
# 5:  Beta 1980         G               F   0         1 3.20       2.34
# 6:  Beta 1990         H            <NA>   1        NA 1.30         NA

这篇关于重复的行:根据条件选择行并存储重复的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆