R数据帧中的重复数据删除/折叠记录 [英] Deduplicating/collapsing records in an R dataframe

查看:123
本文介绍了R数据帧中的重复数据删除/折叠记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由各种个人组成的数据集,每个人都有一个唯一的身份。每个人都可以在数据集中多次出现,但我的理解是,除了一个或两个变量(每个个体约有80个)之外,对于数据集中相同用户ID的每个条目,值应该相同。



如果可以,我想尝试折叠数据。我的主要障碍是我需要回填的一些空值。我正在寻找一个可以完成重复数据删除的功能,如下所示:

 #构建样本数据集
df1 = data.frame(id = rep(1:6,2)
,classA = rep(c('a','b'),6)
,classB = rep(c(1001:1006 ),2)

df1 = df1 [order(df1 $ id),]
df1 $ classC = c('a',NA,'b',NA,NA,NA ,'e','d',NA,'f',NA,NA)
df1 [10,classB] = NA
df1 = df1 [df1 $ id!= 6,]

#sample dataset
> df1
id classA classB classC
1 1 a 1001 a
7 1 a 1001< NA>
2 2 b 1002 b
8 2 b 1002< NA>
3 3 a 1003< NA>
9 3 a 1003< NA>
4 4 b 1004 e
10 4 b 1004 d
5 5 a 1005< NA>
11 5 a NA f

#我正在寻找
>重复数据删除(df1,on ='id')
id classA classB classC
1 1 a 1001 a
2 2 b 1002 b
3 3 a 1003< NA>
4 4 b 1004 d
5 4 b 1004 e
6 5 a 1005 f


解决方案

这个怎么样? (使用 data.table 的解决方案)

  require(data.table) 
DT< - data.table(df1)
#忽略这里的警告
unique(DT [,lapply(.SD,function(x)x [!is.na (x)]),by = id])

id classA classB classC
1:1 a 1001 a
2:2 b 1002 b
3:3 a 1003 NA
4:4 b 1004 e
5:4 b 1004 d
6:5 a 1005 f

一些解释:




  • by = id part split / groups your data.table DT by id

  • .SD 是一个只读变量,可以为每个 id 一次一个。

  • 我们因此将 DT 分割为 id ,并分配给每个拆分部分,使用 lapply (取每列)并删除所有 NA s。现在,如果你让我们说 a,NA ,那么 NA 将被删除,它返回 a 。但输入的长度为2( a,NA )。所以,它会自动 a 以适应大小(= 2)。所以,基本上我们用一些已经存在的价值取代所有的NA。当两者都是 NA (如 NA,NA ), NA
  • 如果你看这部分 DT [,lapply(.SD,function(x)x [!is。 na(x)]),by = id] ,你应该能够理解已经完成了什么。每个 NA 将被替换。所以我们需要做的就是拿起独特的行。这就是为什么它用独特的包裹。



希望这有帮助。你必须尝试一点才能更好地了解。我建议从这里开始: DT [,print(.SD),by = id]






最终解决方案:



我刚刚意识到,如果您有以下解决方案,例如 id = 4 另一行与 classC = NA (其他一切都是一样的)。这是由于回收问题。这个代码应该解决它。

  unique(DT [,lapply(.SD,function(x){x [is.na (x)]<  -  x [!is.na(x)] [1]; x}),by = id])


I have a dataset that is comprised of various individuals, where each individual has a unique id. Each individual can appear multiple times in the dataset, but it's my understanding that besides differing in one or two variables (there are about 80 for each individual) the values should be the same for each entry for the same user id in the dataset.

I want to try to collapse the data if I can. My main obstacle is certain null values that I need to back populate. I'm looking for a function that can accomplish deduplication looking something like this:

# Build sample dataset
df1 = data.frame(id=rep(1:6,2)                 
                ,classA=rep(c('a','b'),6)
                ,classB=rep(c(1001:1006),2)
                )
df1= df1[order(df1$id),]
df1$classC=c('a',NA,'b',NA,NA,NA,'e','d', NA, 'f', NA, NA)
df1[10,"classB"]=NA
df1=df1[df1$id!=6,]

#sample dataset
> df1
   id classA classB classC
1   1      a   1001      a
7   1      a   1001   <NA>
2   2      b   1002      b
8   2      b   1002   <NA>
3   3      a   1003   <NA>
9   3      a   1003   <NA>
4   4      b   1004      e
10  4      b   1004      d
5   5      a   1005   <NA>
11  5      a     NA      f        

# what I'm looking for
> deduplicate(df1, on='id')
  id classA classB classC
1  1      a   1001      a
2  2      b   1002      b
3  3      a   1003   <NA>
4  4      b   1004      d
5  4      b   1004      e
6  5      a   1005      f     

解决方案

How about this? (solution using data.table)

require(data.table)
DT <- data.table(df1)
# ignore the warning here...
unique(DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id])

   id classA classB classC
1:  1      a   1001      a
2:  2      b   1002      b
3:  3      a   1003     NA
4:  4      b   1004      e
5:  4      b   1004      d
6:  5      a   1005      f

Some explanation:

  • the by = id part splits/groups your data.table DT by id.
  • .SD is a read-only variable that automatically picks up each split/group for each id one at a time.
  • we therefore split DT by id, and to each split part, use lapply (to take each column) and remove all NAs. Now, if you've let's say a, NA, then, the NA gets removed and it returns a. But the input was of length 2 (a, NA). So, it automatically recycles a to fit the size (=2). So, essentially we replace all NA's with some already existing value. When both are NA (like NA, NA), NAs are returned (again through recycling).
  • If you look at this part DT[, lapply(.SD, function(x) x[!is.na(x)]), by = id], you should be able to understand what has been done. Every NA will have been replaced. So, all we need to do is pick-up unique rows. And that's why it's wrapped with unique.

Hope this helps. You'll have to experiment a bit to understand better. I suggest starting here: DT[, print(.SD), by=id]


Final solution:

I just realised that the above solution will not work if you've got, for example, for id=4 another row with classC = NA (and everything else is the same). This happens due to recycling issue. This code should fix it.

unique(DT[, lapply(.SD, function(x) {x[is.na(x)] <- x[!is.na(x)][1]; x}), by = id])

这篇关于R数据帧中的重复数据删除/折叠记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆