匹配数据帧排除上次非NA值并忽略顺序 [英] Match Dataframes Excluding Last Non-NA Value and disregarding order

查看:107
本文介绍了匹配数据帧排除上次非NA值并忽略顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框:

 合作伙伴< -c(Alpha,Beta,Zeta)
COL1 <-c(A,C,M)
COL2 <-c(B,D,K)
COL3 < C,F,NA)
COL4 <-c(D,NA,NA)
df1 <-data.frame(Partner,COL1,COL2,COL3,COL4)

lift <-c(9,10,11,12,12,23,12,24)
RULE1 <-c(B,B,D,A ,C,K,M,K)
RULE2 <-c(A,A,C,B,A ,T,M)
RULE3 <-c(G,D,M,C,M,E,NA,NA)
RULE4< -c(NA,NA,K,D,NA,NA,NA,NA)

df2< -data.frame(lift,RULE1,RULE2,RULE3,RULE4)

df1
合作伙伴COL1 COL2 COL3 COL4
Alpha ABCD
Beta CDF NA
Zeta MK NA NA

df2
lift RULE1 RULE2 RULE3 RULE4
9 BAG NA
10 BAD NA
11 DCMK
12 ABC D
12 CAM NA
23 KME NA
12 MT NA NA
24 KM NA NA

这是一个市场篮子分析。 df1是购买列出的每个项目的客户/合作伙伴:A,B,C ...等。



df2是与在过去。



每个df2行中的最后一个值代表建议。因此,从最后一个非NA值的每一行中的前一个值是筐。



所以例如在df2的第一行中,它表示:If B和A一起购买,推荐G.



我想知道df1中的每个合作伙伴是否从全部排除了所有值,不包括最终值那就是推荐。然后将该建议添加到新数据帧的每一行的末尾。



例如:
对于合作伙伴:Alpha,从第一行推荐值G是否会很好?答案是肯定的,因为他们在df2(A和B)中购买了该行中的所有值。



对于合作伙伴:测试版,建议值G并不好,因为并不是所有的df2第一行中的值都位于Beta行。



最终输出:

 合作伙伴COL1 COL2 COL3 COL4提升RULE1 RULE2 RULE3 RULE4最后一个非NA值是否存在行? 
Alpha ABCD 9 BAG NA否
Alpha ABCD 10 BAD NA是
Alpha ABCD 12 ABCD是
Alpha ABCD 12 CAM NA否
Zeta MK NA NA 23 KME NA否
Zeta MK NA NA 12 MT NA NA否
Zeta MK NA NA 24 KM NA NA是

为清楚起见,写出结果:



df3



row1 ou输入否,因为在Alpha合作伙伴中没有找到G,而在Alpha合作伙伴(B,A)之前出现所有值。



row2输出是,因为D是在Alpha合作伙伴中发现的所有价值以及Alpha合作伙伴(B,A)之前的所有价值



row3输出是,因为D在Alpha Partner中找到, D出现在Alpha合作伙伴(A,B,C)



row4输出否,因为在Alpha合作伙伴中找不到M,而在Alpha之前的所有值都出现在Alpha合作伙伴(C,A)



row5输出否,因为在Zeta合作伙伴中没有找到E,而在Zeta合作伙伴(K,M) / p>

row6输出否,因为在Zeta合作伙伴中没有找到T,而在Zeta合作伙伴(M)中显示T之前的所有值



row7输出是,因为在Zeta合作伙伴中找到M,而在Zeta合作伙伴(K)中显示M之前的所有值



我认为必须是一个加入或某种比赛,但不能弄清楚如何d



如果有人可以帮助我,这将非常有用。



谢谢。 / p>

这是尝试:

  df1< -cbind(df1_id = 1:nrow(df1),df1)
df2 < - cbind(df2_id = 1:nrow(df2),df2)
d11 <-df1%>%gather(Col,Value,starts_with (C))#Long
d11< - d11%>%na.omit()%>%group_by(df1_id)%>%slice(-n())#remove last non NA
d22< - df2%>%gather(Rule,Value,starts_with(R))
res < - inner_join(d11,d22)
rm(d22)
rm(d11)
final< -cbind(df1 [res $ df1_id,],df2 [res $ df2_id,])
final $ Exist< - apply(final,1,FUN = function (x)
c(否,是)[(anyDuplicated(x [!is.na(x)& x!=])!= 0)+1])$ ​​b $ b

但这没有办法因为它没有考虑到所有的值,只有当其中一个匹配...不是全部。

解决方案

这是非常棘手的,因为购买 n 客户必须与一组 m 规则进行比较。除此之外,还有两点增加了复杂性:


  1. 最后一个非NA df2 中的列在语义上与其他列不同。不幸的是,给定的数据结构并没有反映出来。因此, df2 缺少一个明确的推荐列。


  2. 必须确定合作伙伴是否已经购买了推荐的商品。


以下方法依赖于 melt() dcast()并加入 data.table >>,,,,,,,,。。。。。。。。。。。。。。。。。。。。。。。>>>>>>>>>> 编辑 dcast()已被移出 lapply()函数。



准备n:m加入的数据



 库(数据。表)
#转换为data.table并添加行号
#这里,一个副本用于setDT()的insteasd,以便重命名data.tables
purchase< - as。 data.table(df1)[,rnp:= seq_len(.N)]
rules< - as.data.table(df2)[,rnr:= seq_len(.N)]

#准备连接的购买
lp< - melt(purchases,id.vars = c(rnp,Partner),na.rm = TRUE)
wp< - dcast lp,rnp〜value,drop = FALSE)
wp
#rnp ABCDFKM
#1:1 ABCD NA NA NA
#2:2 NA NA CDF NA NA
#3:3 NA NA NA NA NA NA KM


#准备规则
lr< - melt(rules,id.vars = c(rnr,lift ),na.rm = TRUE)
#识别成为推荐的每个规则的最后一列
rn_of_last_col< - lr [,last(.I),by = rnr] [,V1]
#从长到宽重新形成,没有推荐
wr< - dcast(lr [-rn_of_last_col],rnr〜value)
#添加列与建议(种cbind,不加入)
wr [,推荐:= lr [rn_of_last_col,值]]
wr
#rnr ABCDKM推荐
#1:1 AB NA NA NA NA G
#2:2 AB NA NA NA NA D
#3:3 NA NA CD NA MK
#4:4 ABC NA NA NA D
#5:5 A NA C NA NA NA M
#6:6 NA NA NA NA KME
#7:7 NA NA NA NA NA NA MT
#8:8 NA NA NA NA K NA M



组合规则和购买



  combi<  -  rbindlist(
#隐含循环超过规则,以查找每个规则的匹配购买
lapply(seq_len(nrow(rules)),函数(i){
#获取col名称,除了最后一列,这是推荐
cols < - lr [rnr == i,value [ - 。N]]
#加入单一规则与所有合作伙伴在这个规则的相关cols
wp [wr [i,.SD,.SDcols = c(cols,rnr,推荐 )],on = cols,nomatch = 0]
})

#检查推荐是否已经购买
combi [,already_purchased:= Reduce(`|`,lapply (.SD,function(x)x ==推荐)),
.SDcols = -c(rnp,rnr,推荐)]
#清理已经购买
combi [is.na(already_purchased),already_purchased:= FALSE
] [,已经购买:= ifelse(已经购买,是,否)]
combi
#rnp ABCDFKM rnr推荐已购买
#1:1 ABCD NA NA NA 1 G No
#2:1 ABCD NA NA NA 2 D是
#3:1 ABCD NA NA NA 4 D是
#4:1 ABCD NA NA NA 5 M否
#5:3 NA NA NA NA NA不适用不适用不适用不适用不适用不适用不适用不适用不适用#7:3 NA NA NA NA NA NA KM 8 M是

在创建 combi 时,诀窍是仅加入每个规则中包含的那些列。这就是为什么每个规则必须单独完成连接的原因。



基本上,我们现在完成了。



最终连接



  tmp_rules<  -  rules [combi [,。(rnp,rnr,recommended,already_purchased)],on =rnr] 
tmp_purch< - purchases [combi [,。(rnp,rnr)] ,on =rnp]
result< - tmp_purch [tmp_rules,on = c(rnp,rnr)]
result [,(c(rnp,rnr ):= NULL]
结果
#合作伙伴COL1 COL2 COL3 COL4提升RULE1 RULE2 RULE3 RULE4推荐已购买
#1:Alpha ABCD 9 BAG NA G No
#2:Alpha ABCD 10 BAD NA D是
#3:Alpha ABCD 12 ABCDD是
#4:Alpha ABCD 12 CAM NA M否
#5:Zeta MK NA NA 23 KME NA E否
#6:Zeta MK NA NA 12 MTN A NA T否
#7:Zeta M K NA NA 24 K M NA NA M是


I have two dataframes:

Partner<-c("Alpha","Beta","Zeta")
COL1<-c("A","C","M")
COL2<-c("B","D","K")
COL3<-c("C","F",NA)
COL4<-c("D",NA,NA)
df1<-data.frame(Partner,COL1,COL2,COL3,COL4)

lift<-c(9,10,11,12,12,23,12,24)
RULE1<-c("B","B","D","A","C","K","M","K")
RULE2<-c("A","A","C","B","A","M","T","M")
RULE3<-c("G","D","M","C" ,"M", "E",NA,NA)
RULE4<-c(NA,NA,"K","D" ,NA, NA,NA,NA)

df2<-data.frame(lift,RULE1,RULE2,RULE3,RULE4)

df1
Partner    COL1    COL2    COL3    COL4
Alpha         A       B       C       D
Beta          C       D       F      NA
Zeta          M       K      NA      NA

df2
lift    RULE1    RULE2     RULE3    RULE4
   9        B        A         G       NA
  10        B        A         D       NA
  11        D        C         M        K
  12        A        B         C        D
  12        C        A         M       NA
  23        K        M         E       NA
  12        M        T        NA       NA
  24        K        M        NA       NA

This is a market basket analysis. df1 is the customer/partner that bought each of the items listed: A, B, C...etc.

df2 are the recommendations associated with the items bought in the past.

The last value in each of the df2 rows represent the recommendation. So the preceding values in each row from the last non-NA value are the "baskets".

So for example in the first row of df2, it is stating: If B and A are bought together, recommend G.

I want to be able to figure out if each partner from df1 bought ALL the values in each row excluding the final value since that is the recommendation. Then add that recommendation to the end of each row of the new dataframe.

For example: For partner: Alpha, would it be good to recommend value G from the first row? Answer would be yes because they bought all the values from that row in df2 (A and B).

For partner: Beta, it would not be good to recommend value G because not all of the values from the first row of df2 are found in the Beta row.

Final Output:

Partner    COL1    COL2    COL3    COL4    lift   RULE1    RULE2    RULE3    RULE4   Does Last Non-NA Value Exist in Row?
Alpha         A       B       C       D       9       B        A        G       NA                                    No
Alpha         A       B       C       D      10       B        A        D       NA                                   Yes
Alpha         A       B       C       D      12       A        B        C        D                                   Yes
Alpha         A       B       C       D      12       C        A        M       NA                                    No
Zeta          M       K      NA      NA      23       K        M        E       NA                                    No
Zeta          M       K      NA      NA      12       M        T       NA       NA                                    No
Zeta          M       K      NA      NA      24       K        M       NA       NA                                   Yes

Written out results for clarity:

df3

row1 outputs "No" because G is not found in Alpha Partner and all values before G show up in Alpha Partner (B,A)

row2 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (B,A)

row3 outputs "Yes" because D is found in Alpha Partner and all values before D show up in Alpha Partner (A,B,C)

row4 outputs "No" because M is not found in Alpha Partner and all values before M show up in Alpha Partner (C,A)

row5 outputs "No" because E is not found in Zeta Partner and all values before E show up in Zeta Partner (K,M)

row6 outputs "No" because T is not found in Zeta Partner and all values before T show up in Zeta Partner (M)

row7 outputs "Yes" because M is found in Zeta Partner and all values before M show up in Zeta Partner (K)

I think that has to be a join or a match of some kind but can't figure out how to do it.

This would be extremely helpful if someone can help me out with this.

Thanks.

This was the attempt:

df1<-cbind(df1_id=1:nrow(df1),df1)
df2 <- cbind(df2_id=1:nrow(df2),df2)
d11  <- df1 %>% gather(Col, Value,starts_with("C"))           #Long
d11 <- d11 %>% na.omit() %>%group_by(df1_id) %>% slice(-n()) #remove last non NA
d22  <- df2 %>%  gather(Rule, Value,starts_with("R"))
res <- inner_join(d11,d22)
rm(d22)
rm(d11)
final<-cbind(df1[res$df1_id,],df2[res$df2_id,])
final$Exist <- apply(final, 1, FUN = function(x) 
c("No", "Yes")[(anyDuplicated(x[!is.na(x) & x != "" ])!=0) +1])

But this didn't work because it didn't take all of the values into account, only if one of them matched...not all.

解决方案

This is quite tricky because the purchases of n customers have to be compared to a set of m rules. Besides this, there are two points which add to the complexity:

  1. The last non-NA RULE column in df2 is semantically different from the others. Unfortunately, the given data structure doesn't reflect this. So, df2 is missing an explicite recommended column.

  2. Finally, it has to be determined whether a partner already has purchased the recommended item.

The approach below relies on melt(), dcast() and join operations of the data.table package for performance reasons. However, in order to avoid creation of cartesian crossproduct of n * m rows, a loop is used.

EDIT The dcast() has been moved out of the lapply() function.

Prepare data for n:m join

library(data.table)
# convert to data.table and add row numbers
# here, a copy is used insteasd of setDT() in order to rename the data.tables
purchases <- as.data.table(df1)[, rnp := seq_len(.N)]
rules <- as.data.table(df2)[, rnr := seq_len(.N)]

# prepare purchases for joins
lp <- melt(purchases, id.vars = c("rnp", "Partner"), na.rm = TRUE)
wp <- dcast(lp, rnp ~ value, drop = FALSE)
wp
#   rnp  A  B  C  D  F  K  M
#1:   1  A  B  C  D NA NA NA
#2:   2 NA NA  C  D  F NA NA
#3:   3 NA NA NA NA NA  K  M


# prepare rules
lr <- melt(rules, id.vars = c("rnr", "lift"), na.rm = TRUE)
# identify last column of each rule which becomes the recommendation
rn_of_last_col <- lr[, last(.I), by = rnr][, V1]
# reshape from long to wide without recommendation
wr <- dcast(lr[-rn_of_last_col], rnr ~ value)
# add column with recommendations (kind of cbind, no join)
wr[, recommended := lr[rn_of_last_col, value]]
wr
#   rnr  A  B  C  D  K  M recommended
#1:   1  A  B NA NA NA NA           G
#2:   2  A  B NA NA NA NA           D
#3:   3 NA NA  C  D NA  M           K
#4:   4  A  B  C NA NA NA           D
#5:   5  A NA  C NA NA NA           M
#6:   6 NA NA NA NA  K  M           E
#7:   7 NA NA NA NA NA  M           T
#8:   8 NA NA NA NA  K NA           M

Combine rules and purchases

combi <- rbindlist(
  # implied loop over rules to find matching purchases for each rule
  lapply(seq_len(nrow(rules)), function(i) {
    # get col names except last col which is the recommendation
    cols <- lr[rnr == i, value[-.N]]
    # join single rule with all partners on relevant cols for this rule
    wp[wr[i, .SD, .SDcols = c(cols, "rnr", "recommended")], on = cols, nomatch = 0]
  })
)
# check if recommendation was purchased already
combi[, already_purchased := Reduce(`|`, lapply(.SD, function(x) x == recommended)), 
      .SDcols = -c("rnp", "rnr", "recommended")]
# clean up already purchased
combi[is.na(already_purchased), already_purchased := FALSE
      ][, already_purchased := ifelse(already_purchased, "Yes", "No")]
combi
#   rnp  A  B  C  D  F  K  M rnr recommended already_purchased
#1:   1  A  B  C  D NA NA NA   1           G                No
#2:   1  A  B  C  D NA NA NA   2           D               Yes
#3:   1  A  B  C  D NA NA NA   4           D               Yes
#4:   1  A  B  C  D NA NA NA   5           M                No
#5:   3 NA NA NA NA NA  K  M   6           E                No
#6:   3 NA NA NA NA NA  K  M   7           T                No
#7:   3 NA NA NA NA NA  K  M   8           M               Yes

In creating combi, the trick is to join only on those columns which are included in each rule. This is why the join needs to be done for each rule separately.

Essentially, we are done now. However, it doesn't look like the desired output.

Final joins

tmp_rules <- rules[combi[, .(rnp, rnr, recommended, already_purchased)], on = "rnr"]
tmp_purch <- purchases[combi[, .(rnp, rnr)], on = "rnp"]
result <- tmp_purch[tmp_rules, on = c("rnp", "rnr")]
result[, (c("rnp", "rnr")) := NULL]
result
#   Partner COL1 COL2 COL3 COL4 lift RULE1 RULE2 RULE3 RULE4 recommend already_purchased
#1:   Alpha    A    B    C    D    9     B     A     G    NA         G                No
#2:   Alpha    A    B    C    D   10     B     A     D    NA         D               Yes
#3:   Alpha    A    B    C    D   12     A     B     C     D         D               Yes
#4:   Alpha    A    B    C    D   12     C     A     M    NA         M                No
#5:    Zeta    M    K   NA   NA   23     K     M     E    NA         E                No
#6:    Zeta    M    K   NA   NA   12     M     T    NA    NA         T                No
#7:    Zeta    M    K   NA   NA   24     K     M    NA    NA         M               Yes

这篇关于匹配数据帧排除上次非NA值并忽略顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆