匹配数据框,排除最后一个非 NA 值并忽略顺序 [英] Match Dataframes Excluding Last Non-NA Value and disregarding order

查看:13
本文介绍了匹配数据框,排除最后一个非 NA 值并忽略顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框:

Partner<-c("Alpha","Beta","Zeta")
COL1<-c("A","C","M")
COL2<-c("B","D","K")
COL3<-c("C","F",NA)
COL4<-c("D",NA,NA)
df1<-data.frame(Partner,COL1,COL2,COL3,COL4)

lift<-c(9,10,11,12,12,23,12,24)
RULE1<-c("B","B","D","A","C","K","M","K")
RULE2<-c("A","A","C","B","A","M","T","M")
RULE3<-c("G","D","M","C" ,"M", "E",NA,NA)
RULE4<-c(NA,NA,"K","D" ,NA, NA,NA,NA)

df2<-data.frame(lift,RULE1,RULE2,RULE3,RULE4)

df1
Partner    COL1    COL2    COL3    COL4
Alpha         A       B       C       D
Beta          C       D       F      NA
Zeta          M       K      NA      NA

df2
lift    RULE1    RULE2     RULE3    RULE4
   9        B        A         G       NA
  10        B        A         D       NA
  11        D        C         M        K
  12        A        B         C        D
  12        C        A         M       NA
  23        K        M         E       NA
  12        M        T        NA       NA
  24        K        M        NA       NA

这是一个购物篮分析.df1 是购买了列出的每个项目的客户/合作伙伴:A、B、C...等.

This is a market basket analysis. df1 is the customer/partner that bought each of the items listed: A, B, C...etc.

df2 是与过去购买的物品相关的推荐.

df2 are the recommendations associated with the items bought in the past.

每个 df2 行中的最后一个值代表建议.因此,从最后一个非 NA 值开始的每一行中的前面值是篮子".

The last value in each of the df2 rows represent the recommendation. So the preceding values in each row from the last non-NA value are the "baskets".

所以比如df2的第一行,就是说:如果B和A一起买,推荐G.

So for example in the first row of df2, it is stating: If B and A are bought together, recommend G.

我希望能够弄清楚 df1 的每个合作伙伴是否购买了每一行中的所有值,不包括最终值,因为这是建议.然后将该建议添加到新数据框每一行的末尾.

I want to be able to figure out if each partner from df1 bought ALL the values in each row excluding the final value since that is the recommendation. Then add that recommendation to the end of each row of the new dataframe.

例如:合伙人:Alpha,从第一行推荐价值G好不好?答案是肯定的,因为他们从 df2(A 和 B)中的该行购买了所有值.

For example: For partner: Alpha, would it be good to recommend value G from the first row? Answer would be yes because they bought all the values from that row in df2 (A and B).

对于合作伙伴:Beta,建议值 G 并不好,因为并非 df2 第一行的所有值都可以在 Beta 行中找到.

For partner: Beta, it would not be good to recommend value G because not all of the values from the first row of df2 are found in the Beta row.

最终输出:

Partner    COL1    COL2    COL3    COL4    lift   RULE1    RULE2    RULE3    RULE4   Does Last Non-NA Value Exist in Row?
Alpha         A       B       C       D       9       B        A        G       NA                                    No
Alpha         A       B       C       D      10       B        A        D       NA                                   Yes
Alpha         A       B       C       D      12       A        B        C        D                                   Yes
Alpha         A       B       C       D      12       C        A        M       NA                                    No
Zeta          M       K      NA      NA      23       K        M        E       NA                                    No
Zeta          M       K      NA      NA      12       M        T       NA       NA                                    No
Zeta          M       K      NA      NA      24       K        M       NA       NA                                   Yes

为清楚起见写出结果:

df3

row1 输出No",因为在 Alpha Partner 中找不到 G,并且 G 之前的所有值都显示在 Alpha Partner (B,A) 中

row1 outputs "No" because G is not found in Alpha Partner and all values before G show up in Alpha Partner (B,A)

row2 输出是",因为在 Alpha 合作伙伴中找到了 D,并且 D 之前的所有值都显示在 Alpha 合作伙伴 (B,A) 中

row2 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (B,A)

第 3 行输出是",因为在 Alpha 合作伙伴中找到了 D,并且 D 之前的所有值都显示在 Alpha 合作伙伴(A、B、C)中

row3 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (A,B,C)

row4 输出No",因为在 Alpha Partner 中找不到 M,并且 M 之前的所有值都显示在 Alpha Partner (C,A) 中

row4 outputs "No" because M is not found in Alpha Partner and all values before M show up in Alpha Partner (C,A)

row5 输出No",因为在 Zeta Partner 中找不到 E,并且 E 之前的所有值都显示在 Zeta Partner (K,M) 中

row5 outputs "No" because E is not found in Zeta Partner and all values before E show up in Zeta Partner (K,M)

row6 输出No",因为在 Zeta Partner (M) 中找不到 T,并且 T 之前的所有值都显示在 Zeta Partner (M) 中

row6 outputs "No" because T is not found in Zeta Partner and all values before T show up in Zeta Partner (M)

row7 输出Yes",因为在 Zeta Partner 中找到了 M,并且 M 之前的所有值都显示在 Zeta Partner (K) 中

row7 outputs "Yes" because M is found in Zeta Partner and all values before M show up in Zeta Partner (K)

我认为这必须是某种连接或匹配,但不知道该怎么做.

I think that has to be a join or a match of some kind but can't figure out how to do it.

如果有人能帮我解决这个问题,这将非常有帮助.

This would be extremely helpful if someone can help me out with this.

谢谢.

这是一次尝试:

df1<-cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
d11  <- df1 %>% gather(Col, Value,starts_with("C"))           #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22  <- df2 %>%  gather(Rule, Value,starts_with("R"))
res <- inner_join(d11,d22)
rm(d22)
rm(d11)
final<-cbind(df1[res$df1_id,],df2[res$df2_id,])
final$Exist <- apply(final, 1, FUN = function(x) 
c("No", "Yes")[(anyDuplicated(x[!is.na(x) & x != "" ])!=0) +1])

但这不起作用,因为它没有考虑所有值,只有当其中一个匹配时......不是全部.

But this didn't work because it didn't take all of the values into account, only if one of them matched...not all.

推荐答案

这很棘手,因为必须将 n 个客户的购买与一组 m 进行比较规则.除此之外,还有两点增加了复杂性:

This is quite tricky because the purchases of n customers have to be compared to a set of m rules. Besides this, there are two points which add to the complexity:

  1. df2 中最后一个非 NA RULE 列在语义上与其他列不同.不幸的是,给定的数据结构并没有反映这一点.因此,df2 缺少明确的 recommended 列.

  1. The last non-NA RULE column in df2 is semantically different from the others. Unfortunately, the given data structure doesn't reflect this. So, df2 is missing an explicite recommended column.

最后要判断合作伙伴是否已经购买了推荐商品.

Finally, it has to be determined whether a partner already has purchased the recommended item.

出于性能原因,以下方法依赖于 melt()dcast()data.table 包的连接操作.但是,为了避免创建 n * m 行的笛卡尔叉积,使用了循环.

The approach below relies on melt(), dcast() and join operations of the data.table package for performance reasons. However, in order to avoid creation of cartesian crossproduct of n * m rows, a loop is used.

EDIT dcast() 已从 lapply() 函数中移出.

EDIT The dcast() has been moved out of the lapply() function.

library(data.table)
# convert to data.table and add row numbers
# here, a copy is used insteasd of setDT() in order to rename the data.tables
purchases <- as.data.table(df1)[, rnp := seq_len(.N)]
rules <- as.data.table(df2)[, rnr := seq_len(.N)]

# prepare purchases for joins
lp <- melt(purchases, id.vars = c("rnp", "Partner"), na.rm = TRUE)
wp <- dcast(lp, rnp ~ value, drop = FALSE)
wp
#   rnp  A  B  C  D  F  K  M
#1:   1  A  B  C  D NA NA NA
#2:   2 NA NA  C  D  F NA NA
#3:   3 NA NA NA NA NA  K  M


# prepare rules
lr <- melt(rules, id.vars = c("rnr", "lift"), na.rm = TRUE)
# identify last column of each rule which becomes the recommendation
rn_of_last_col <- lr[, last(.I), by = rnr][, V1]
# reshape from long to wide without recommendation
wr <- dcast(lr[-rn_of_last_col], rnr ~ value)
# add column with recommendations (kind of cbind, no join)
wr[, recommended := lr[rn_of_last_col, value]]
wr
#   rnr  A  B  C  D  K  M recommended
#1:   1  A  B NA NA NA NA           G
#2:   2  A  B NA NA NA NA           D
#3:   3 NA NA  C  D NA  M           K
#4:   4  A  B  C NA NA NA           D
#5:   5  A NA  C NA NA NA           M
#6:   6 NA NA NA NA  K  M           E
#7:   7 NA NA NA NA NA  M           T
#8:   8 NA NA NA NA  K NA           M

结合规则和购买

combi <- rbindlist(
  # implied loop over rules to find matching purchases for each rule
  lapply(seq_len(nrow(rules)), function(i) {
    # get col names except last col which is the recommendation
    cols <- lr[rnr == i, value[-.N]]
    # join single rule with all partners on relevant cols for this rule
    wp[wr[i, .SD, .SDcols = c(cols, "rnr", "recommended")], on = cols, nomatch = 0]
  })
)
# check if recommendation was purchased already
combi[, already_purchased := Reduce(`|`, lapply(.SD, function(x) x == recommended)), 
      .SDcols = -c("rnp", "rnr", "recommended")]
# clean up already purchased
combi[is.na(already_purchased), already_purchased := FALSE
      ][, already_purchased := ifelse(already_purchased, "Yes", "No")]
combi
#   rnp  A  B  C  D  F  K  M rnr recommended already_purchased
#1:   1  A  B  C  D NA NA NA   1           G                No
#2:   1  A  B  C  D NA NA NA   2           D               Yes
#3:   1  A  B  C  D NA NA NA   4           D               Yes
#4:   1  A  B  C  D NA NA NA   5           M                No
#5:   3 NA NA NA NA NA  K  M   6           E                No
#6:   3 NA NA NA NA NA  K  M   7           T                No
#7:   3 NA NA NA NA NA  K  M   8           M               Yes

在创建combi 时,诀窍是仅在每个规则中包含的那些列上加入.这就是为什么需要为每个规则分别进行连接的原因.

In creating combi, the trick is to join only on those columns which are included in each rule. This is why the join needs to be done for each rule separately.

基本上,我们现在完成了.但是,它看起来不像所需的输出.

Essentially, we are done now. However, it doesn't look like the desired output.

tmp_rules <- rules[combi[, .(rnp, rnr, recommended, already_purchased)], on = "rnr"]
tmp_purch <- purchases[combi[, .(rnp, rnr)], on = "rnp"]
result <- tmp_purch[tmp_rules, on = c("rnp", "rnr")]
result[, (c("rnp", "rnr")) := NULL]
result
#   Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 recommend already_purchased
#1:   Alpha    A    B    C    D    9     B     A     G    NA         G                No
#2:   Alpha    A    B    C    D   10     B     A     D    NA         D               Yes
#3:   Alpha    A    B    C    D   12     A     B     C     D         D               Yes
#4:   Alpha    A    B    C    D   12     C     A     M    NA         M                No
#5:    Zeta    M    K   NA   NA   23     K     M     E    NA         E                No
#6:    Zeta    M    K   NA   NA   12     M     T    NA    NA         T                No
#7:    Zeta    M    K   NA   NA   24     K     M    NA    NA         M               Yes

这篇关于匹配数据框,排除最后一个非 NA 值并忽略顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆