R数据帧中的重复数据删除/折叠记录 [英] Deduplicating/collapsing records in an R dataframe
问题描述
如果可以,我想尝试折叠数据。我的主要障碍是我需要回填的一些空值。我正在寻找一个可以完成重复数据删除的功能,如下所示:
#构建样本数据集
df1 = data.frame(id = rep(1:6,2)
,classA = rep(c('a','b'),6)
,classB = rep(c(1001:1006 ),2)
)
df1 = df1 [order(df1 $ id),]
df1 $ classC = c('a',NA,'b',NA,NA,NA ,'e','d',NA,'f',NA,NA)
df1 [10,classB] = NA
df1 = df1 [df1 $ id!= 6,]
#sample dataset
> df1
id classA classB classC
1 1 a 1001 a
7 1 a 1001< NA>
2 2 b 1002 b
8 2 b 1002< NA>
3 3 a 1003< NA>
9 3 a 1003< NA>
4 4 b 1004 e
10 4 b 1004 d
5 5 a 1005< NA>
11 5 a NA f
#我正在寻找
>重复数据删除(df1,on ='id')
id classA classB classC
1 1 a 1001 a
2 2 b 1002 b
3 3 a 1003< NA>
4 4 b 1004 d
5 4 b 1004 e
6 5 a 1005 f
这个怎么样? (使用 data.table
的解决方案)
require(data.table)
DT< - data.table(df1)
#忽略这里的警告
unique(DT [,lapply(.SD,function(x)x [!is.na (x)]),by = id])
id classA classB classC
1:1 a 1001 a
2:2 b 1002 b
3:3 a 1003 NA
4:4 b 1004 e
5:4 b 1004 d
6:5 a 1005 f
一些解释:
-
by = id
part split / groups your data.tableDT
byid
。 -
.SD
是一个只读变量,可以为每个id
一次一个。 - 我们因此将
DT
分割为id
,并分配给每个拆分部分,使用lapply
(取每列)并删除所有NA
s。现在,如果你让我们说a,NA
,那么NA
将被删除,它返回a
。但输入的长度为2(a,NA
)。所以,它会自动a
以适应大小(= 2)。所以,基本上我们用一些已经存在的价值取代所有的NA。当两者都是NA
(如NA,NA
),NA
- 如果你看这部分
DT [,lapply(.SD,function(x)x [!is。 na(x)]),by = id]
,你应该能够理解已经完成了什么。每个NA
将被替换。所以我们需要做的就是拿起独特的
行。这就是为什么它用独特的
包裹。
希望这有帮助。你必须尝试一点才能更好地了解。我建议从这里开始: DT [,print(.SD),by = id]
最终解决方案:
我刚刚意识到,如果您有以下解决方案,例如 id = 4
另一行与 classC = NA
(其他一切都是一样的)。这是由于回收问题。这个代码应该解决它。
unique(DT [,lapply(.SD,function(x){x [is.na (x)]< - x [!is.na(x)] [1]; x}),by = id])
I have a dataset that is comprised of various individuals, where each individual has a unique id. Each individual can appear multiple times in the dataset, but it's my understanding that besides differing in one or two variables (there are about 80 for each individual) the values should be the same for each entry for the same user id in the dataset.
I want to try to collapse the data if I can. My main obstacle is certain null values that I need to back populate. I'm looking for a function that can accomplish deduplication looking something like this:
# Build sample dataset
df1 = data.frame(id=rep(1:6,2)
,classA=rep(c('a','b'),6)
,classB=rep(c(1001:1006),2)
)
df1= df1[order(df1$id),]
df1$classC=c('a',NA,'b',NA,NA,NA,'e','d', NA, 'f', NA, NA)
df1[10,"classB"]=NA
df1=df1[df1$id!=6,]
#sample dataset
> df1
id classA classB classC
1 1 a 1001 a
7 1 a 1001 <NA>
2 2 b 1002 b
8 2 b 1002 <NA>
3 3 a 1003 <NA>
9 3 a 1003 <NA>
4 4 b 1004 e
10 4 b 1004 d
5 5 a 1005 <NA>
11 5 a NA f
# what I'm looking for
> deduplicate(df1, on='id')
id classA classB classC
1 1 a 1001 a
2 2 b 1002 b
3 3 a 1003 <NA>
4 4 b 1004 d
5 4 b 1004 e
6 5 a 1005 f
How about this? (solution using data.table
)
require(data.table)
DT <- data.table(df1)
# ignore the warning here...
unique(DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id])
id classA classB classC
1: 1 a 1001 a
2: 2 b 1002 b
3: 3 a 1003 NA
4: 4 b 1004 e
5: 4 b 1004 d
6: 5 a 1005 f
Some explanation:
- the
by = id
part splits/groups your data.tableDT
byid
. .SD
is a read-only variable that automatically picks up each split/group for eachid
one at a time.- we therefore split
DT
byid
, and to each split part, uselapply
(to take each column) and remove allNA
s. Now, if you've let's saya, NA
, then, theNA
gets removed and it returnsa
. But the input was of length 2 (a, NA
). So, it automatically recyclesa
to fit the size (=2). So, essentially we replace all NA's with some already existing value. When both areNA
(likeNA, NA
),NA
s are returned (again through recycling). - If you look at this part
DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id]
, you should be able to understand what has been done. EveryNA
will have been replaced. So, all we need to do is pick-upunique
rows. And that's why it's wrapped withunique
.
Hope this helps. You'll have to experiment a bit to understand better. I suggest starting here: DT[, print(.SD), by=id]
Final solution:
I just realised that the above solution will not work if you've got, for example, for id=4
another row with classC = NA
(and everything else is the same). This happens due to recycling issue. This code should fix it.
unique(DT[, lapply(.SD, function(x) {x[is.na(x)] <- x[!is.na(x)][1]; x}), by = id])
这篇关于R数据帧中的重复数据删除/折叠记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!