识别一个值的重复项,另一列中的值不同 [英] Identify duplicates of one value with different values in another column

查看:94
本文介绍了识别一个值的重复项,另一列中的值不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ID和地址的数据框.通常,我希望每个重复出现的ID在所有观察中都具有相同的地址,但是我的某些ID具有不同的地址.我想找到在ID上重复但没有至少2个不同地址的那些观测值.然后,我要为其中的一个随机分配一个新ID(以前在DF中不存在的ID).

I have a dataframe of IDs and addresses. Normally, I would expect each recurring ID to have the same address in all observations, but some of my IDs have different addresses. I want to locate those observations that are duplicated on ID, but have at least 2 different addresses. Then, I want to randomize a new ID for one of them (an ID that didn't exist in the DF before).

例如:

ID     Address
1      X
1      X  
1      Y
2      Z
2      Z
3      A
3      B
4      C
4      D
4      E
5      F
5      F
5      F

会返回:

ID    Address
1      X
1      X  
6      Y
2      Z
2      Z
3      A
7      B
4      C
8      D
9      E
5      F
5      F
5      F

所以发生的是第3、7、9和10个观测值获得了新的ID.我将提到一个ID可能有两个以上的不同地址,因此应该为每个唯一地址授予新的ID.

So what happened is the 3rd,7th, 9th and 10th observations got new IDs. I will mention that it is possible for an ID to have even more than 2 different addresses, so the granting of new IDs should happen for each unique address.

我为一个较长的数据帧示例添加了一个代码,其中的rand列应被忽略,但保留在最终输出中.

I added a code for a longer example of a data frame, with rand column that should be ignored but kept in final output.

df <- data.frame(ID = c(1,1,1,2,2,3,3,4,4,4,5,5,5),
             Address = c("x","x","y","z","z","a","b","c","d","e",
                         "f","f","f"),
             rand = sample(1:100, 13))

推荐答案

这是具有tidyr和功能nest/unnest

library(tidyr)
library(dplyr)
df %>% group_by(ID,Address) %>% nest %>%
  `[<-`(duplicated(.$ID),"ID",max(.$ID, na.rm = TRUE) + 1:sum(duplicated(.$ID))) %>%
  unnest

# # A tibble: 13 x 3
# ID Address  rand
#    <dbl>  <fctr> <int>
#  1     1       x    58
#  2     1       x     4
#  3     6       y    75
#  4     2       z     5
#  5     2       z    19
#  6     3       a    55
#  7     7       b    34
#  8     4       c    53
#  9     8       d    98
# 10     9       e    97
# 11     5       f    13
# 12     5       f    64
# 13     5       f    80

如果使用magrittr,则如果需要更漂亮的代码(相同的输出),请用inset替换[<-.

If you use magrittr, replace [<- with inset if you want prettier code (same output).

这篇关于识别一个值的重复项,另一列中的值不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆