识别一个值的重复项,另一列中的值不同 [英] Identify duplicates of one value with different values in another column
问题描述
我有一个ID和地址的数据框.通常,我希望每个重复出现的ID在所有观察中都具有相同的地址,但是我的某些ID具有不同的地址.我想找到在ID上重复但没有至少2个不同地址的那些观测值.然后,我要为其中的一个随机分配一个新ID(以前在DF中不存在的ID).
I have a dataframe of IDs and addresses. Normally, I would expect each recurring ID to have the same address in all observations, but some of my IDs have different addresses. I want to locate those observations that are duplicated on ID, but have at least 2 different addresses. Then, I want to randomize a new ID for one of them (an ID that didn't exist in the DF before).
例如:
ID Address
1 X
1 X
1 Y
2 Z
2 Z
3 A
3 B
4 C
4 D
4 E
5 F
5 F
5 F
会返回:
ID Address
1 X
1 X
6 Y
2 Z
2 Z
3 A
7 B
4 C
8 D
9 E
5 F
5 F
5 F
所以发生的是第3、7、9和10个观测值获得了新的ID.我将提到一个ID可能有两个以上的不同地址,因此应该为每个唯一地址授予新的ID.
So what happened is the 3rd,7th, 9th and 10th observations got new IDs. I will mention that it is possible for an ID to have even more than 2 different addresses, so the granting of new IDs should happen for each unique address.
我为一个较长的数据帧示例添加了一个代码,其中的rand列应被忽略,但保留在最终输出中.
I added a code for a longer example of a data frame, with rand column that should be ignored but kept in final output.
df <- data.frame(ID = c(1,1,1,2,2,3,3,4,4,4,5,5,5),
Address = c("x","x","y","z","z","a","b","c","d","e",
"f","f","f"),
rand = sample(1:100, 13))
推荐答案
这是具有tidyr
和功能nest
/unnest
library(tidyr)
library(dplyr)
df %>% group_by(ID,Address) %>% nest %>%
`[<-`(duplicated(.$ID),"ID",max(.$ID, na.rm = TRUE) + 1:sum(duplicated(.$ID))) %>%
unnest
# # A tibble: 13 x 3
# ID Address rand
# <dbl> <fctr> <int>
# 1 1 x 58
# 2 1 x 4
# 3 6 y 75
# 4 2 z 5
# 5 2 z 19
# 6 3 a 55
# 7 7 b 34
# 8 4 c 53
# 9 8 d 98
# 10 9 e 97
# 11 5 f 13
# 12 5 f 64
# 13 5 f 80
如果使用magrittr
,则如果需要更漂亮的代码(相同的输出),请用inset
替换[<-
.
If you use magrittr
, replace [<-
with inset
if you want prettier code (same output).
这篇关于识别一个值的重复项,另一列中的值不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!