如何为r中两个数据帧之间的匹配观测值分配相同的唯一ID? [英] How to assign identical unique IDs to matching observations between two dataframes in r?

查看:106
本文介绍了如何为r中两个数据帧之间的匹配观测值分配相同的唯一ID?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


当我有两个(或更多)数据帧并想为每个数据集内和两个数据集内的每个匹配观测值分配唯一的ID时,我遇到一个实际的问题,例如:


I have a practical question when I have two (or more) data frames and want to assign unique IDs for each matching observation within each and across both datasets e.g.:

#1. Create dataframe df1:

a1 <- c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1)
b1 <- c(1, 5, 3, 2, 3, 4, 5, 1, 5, 2)
c1 <- c("white", "red", "black", "white", "red", 
        "white", "black", "silver", "red", "green")
df1 <- data.frame(a1, b1, c1)
df1

   a1 b1     c1
1   1  1  white
2   1  5    red
3   1  3  black
4   1  2  white
5   2  3    red
6   2  4  white
7   2  5  black
8   2  1 silver
9   1  5    red
10  1  2  green

#2. Create dataframe df2:

a2 <- c(2, 2, 1, 1, 2, 2, 2, 2, 2, 2)
b2 <- c(3, 1, 3, 2, 1, 3, 4, 5, 3, 5)
c2 <- c("black", "blue", "black", "white", "silver", 
        "green", "green", "red", "blue", "white")
df2 <- data.frame(a2, b2, c2)
df2

   a2 b2     c2
1   2  3  black
2   2  1   blue
3   1  3  black
4   1  2  white
5   2  1 silver
6   2  3  green
7   2  4  green
8   2  5    red
9   2  3   blue
10  2  5  white

#3. Assign unique IDs to each observation in df1:

library(data.table)
df1.2 <- data.table(df1, key="a1,b1,c1") 
df1.2[, id:=.GRP, by=key(df1.2)]
df1.2 <- as.data.frame(df1.2)
df1.2

   a1 b1     c1 id
1   1  1  white  1
2   1  2  green  2
3   1  2  white  3
4   1  3  black  4
5   1  5    red  5
6   1  5    red  5
7   2  1 silver  6
8   2  3    red  7
9   2  4  white  8
10  2  5  black  9

#4. The problematic part!! Assign identical unique IDs to matching observations of df2 as compared to df1.2 
#and assign other unique IDs to all other non-matching obs of df2. 
#Name the resulting dataframe as df2.2 
#My expected result will ideally look as follows:

df2.2

   a2 b2     c2 id
1   2  3  black 10 
2   2  1   blue 11
3   1  3  black  4
4   1  2  white  3
5   2  1 silver  6
6   2  3  green 12
7   2  4  green 13
8   2  5    red 14
9   2  3   blue 15
10  2  5  white 16

对于如何获得df2.2的任何帮助,我们将不胜感激.谢谢.

Any help on how to get to df2.2 will be very much appreciated. Thanks.

推荐答案

一种简单的方法是进行哈希处理:

An easy way to approach this is to make a hash:

library(dplyr)
library(digest)

df1 %>%
  rowwise() %>%
  do( data.frame(., id=digest( paste(.$a1,.$b1,.$c1), algo="md5"),
                   stringsAsFactors=FALSE)) %>% ungroup()

df2 %>%
  rowwise() %>%
  do( data.frame(., id=digest( paste(.$a2,.$b2,.$c2), algo="md5"),
               stringsAsFactors=FALSE)) %>% ungroup()

这将为df1生成以下内容:

   a1 b1     c1                               id
1   1  1  white b86fbb78b27f7db2ee50af2d68cce452
2   1  5    red 68d47f544832989834517630e4a2764c
3   1  3  black 724e37192140cb2009cf3d982f2be1e4
4   1  2  white f731b8b38255b8c312543283f8e1c634
5   2  3    red 2d50b86902056a51faad04d2c566faf2
6   2  4  white 9396667cd51d1e1b61b0b22a7767d3d9
7   2  5  black 9ba1f3e04c61c006d3c5382fcad098e6
8   2  1 silver 38dcd29d200c8b33cd38ac78ef9dd751
9   1  5    red 68d47f544832989834517630e4a2764c
10  1  2  green 7d9b1aadfd79de142b234b83d7867b9b

以及df2的以下内容:

   a2 b2     c2                               id
1   2  3  black d285febc8ab08e99b11609b98f077e66
2   2  1   blue bfa0405276406ac4bc596daf957dfa11
3   1  3  black 724e37192140cb2009cf3d982f2be1e4
4   1  2  white f731b8b38255b8c312543283f8e1c634
5   2  1 silver 38dcd29d200c8b33cd38ac78ef9dd751
6   2  3  green 67eefe9ee2d82486ded30a268289296b
7   2  4  green d773f58cf144eab15ef459e326494a2f
8   2  5    red 0724318a9f59d3960edfe4e90f9c4eff
9   2  3   blue 6883420cc137ba45b773f642176e9ce6
10  2  5  white 5dea9e63b5fbfb31fb81260cb5a5f41c

这篇关于如何为r中两个数据帧之间的匹配观测值分配相同的唯一ID?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆