查看一个数据帧行中的所有值是否存在于另一个数据帧中 [英] Seeing if all values in one dataframe row exist in another dataframe

查看:83
本文介绍了查看一个数据帧行中的所有值是否存在于另一个数据帧中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框如下:

df1    

ColA     ColB     ColC     ColD
  10        A        B        L
  11        N        Q       NA
  12        P        J        L
  43        M        T       NA
  89        O        J        T

df2

ATTR      Att      R1   R2    R3    R4
   1       45       A    B    NA    NA
   2       40       C    D    NA    NA
   3       33       T    J     O    NA
   4       65       L   NA    NA    NA
   5       20       P    L     J    NA
   6       23       Q   NA    NA    NA
   7       38       Q    L    NA    NA

如何匹配df2使用df1,如果在df1行中显示每个df2行(忽略顺序)中的所有值,则将填充它。所以检查每个df2行中的所有不只是一个值与每个df1行匹配。在这种情况下的最终结果应该是:

How do I match up df2 with df1 so that if ALL the values in each df2 row (disregarding the order) show up in the df1 rows, then it will populate. So it is checking if ALL not just one value from each df2 row matches up with each df1 row. The final result in this case should be this:

ColA     ColB     ColC     ColD   ATTR      Att      R1   R2    R3    R4
  10        A        B        L      1       45       A    B    NA    NA
  10        A        B        L      4       65       L   NA    NA    NA
  11        N        Q       NA      6       23       Q   NA    NA    NA
  12        P        J        L      4       65       L   NA    NA    NA
  12        P        J        L      5       20       P    L     J    NA    
  89        O        J        T      3       33       T    J     O    NA

谢谢

推荐答案

这是一个可能的解决方案,使用基础R。

Here is a possible solution using base R.

确保一切都是一个字符继续之前,即

Make sure everything is a character before continuing, i.e.

df[-1] <- lapply(df[-1], as.character)
df1[-c(1:2)] <- lapply(df1[-c(1:2)], as.character)

首先我们创建两个列表,其中包含每个数据帧的横向元素的向量。然后,我们创建一个矩阵,其长度为 l2 的元素位于 l1 中,如果长度为0,那么意味着他们匹配。即,

First we create two lists which contain vectors of the rowwise elements of each data frame. We then create a matrix with the length of elements from l2 are found in l1, If the length is 0 then it means they match. i.e,

l1 <- lapply(split(df[-1], seq(nrow(df))), function(i) i[!is.na(i)])
l2 <- lapply(split(df1[-c(1:2)], seq(nrow(df1))), function(i) i[!is.na(i)])

m1 <- sapply(l1, function(i) sapply(l2, function(j) length(setdiff(j, i))))
m1
#  1 2 3 4 5
#1 0 2 2 2 2
#2 2 2 2 2 2
#3 3 3 2 2 0
#4 0 1 0 1 1
#5 2 3 0 3 2
#6 1 0 1 1 1
#7 1 1 1 2 2

然后我们使用该矩阵在我们原来的 df 中创建几个列。第一列 rpt 将指示每行的长度为0的次数,并将其用作每行的重复次数。我们还使用它来过滤所有0个长度(即,与 df1 不匹配的行)。扩展数据框后,我们创建另一个变量; ( ATTR (与 ATTR 在同一个名字 df1 )将其用于合并。即

We then use that matrix to create a couple of coloumns in our original df. The first column rpt will indicate how many times each row has length 0 and use that as a number of repeats for each row. We also use it to filter out all the 0 lengths (i.e. the rows that do not have a match with df1). After expanding the data frame we create another variable; ATTR (same name as ATTR in df1) in order to use it for a merge. i.e.

df$rpt <- colSums(m1 == 0)
df <- df[df$rpt != 0,]
df <- df[rep(row.names(df), df$rpt),]
df$ATTR <- which(m1 == 0, arr.ind = TRUE)[,1]
df
#    ColA ColB ColC ColD rpt ATTR
#1     10    A    B    L   2    1
#1.1   10    A    B    L   2    4
#2     11    N    Q <NA>   1    6
#3     12    P    J    L   2    4
#3.1   12    P    J    L   2    5
#5     89    O    J    T   1    3

然后我们然后 merge 并订购两个数据框,

We then merge and order the two data frames,

final_df <- merge(df, df1, by = 'ATTR')

final_df[order(final_df$ColA),]
#  ATTR ColA ColB ColC ColD rpt Att R1   R2   R3   R4
#1    1   10    A    B    L   2  45  A    B <NA> <NA>
#3    4   10    A    B    L   2  65  L <NA> <NA> <NA>
#6    6   11    N    Q <NA>   1  23  Q <NA> <NA> <NA>
#4    4   12    P    J    L   2  65  L <NA> <NA> <NA>
#5    5   12    P    J    L   2  20  P    L    J <NA>
#2    3   89    O    J    T   1  33  T    J    O <NA>

数据

dput(df)
structure(list(ColA = c(10L, 11L, 12L, 43L, 89L), ColB = c("A", 
"N", "P", "M", "O"), ColC = c("B", "Q", "J", "T", "J"), ColD = c("L", 
NA, "L", NA, "T")), .Names = c("ColA", "ColB", "ColC", "ColD"
), row.names = c(NA, -5L), class = "data.frame")

dput(df1)
structure(list(ATTR = 1:7, Att = c(45L, 40L, 33L, 65L, 20L, 23L, 
38L), R1 = c("A", "C", "T", "L", "P", "Q", "Q"), R2 = c("B", 
"D", "J", NA, "L", NA, "L"), R3 = c(NA, NA, "O", NA, "J", NA, 
NA), R4 = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_)), .Names = c("ATTR", 
"Att", "R1", "R2", "R3", "R4"), row.names = c(NA, -7L), class = "data.frame")

这篇关于查看一个数据帧行中的所有值是否存在于另一个数据帧中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆