使用dplyr将一个data.frame的输出管道输出到另一个 [英] Pipe output of one data.frame to another using dplyr

查看:129
本文介绍了使用dplyr将一个data.frame的输出管道输出到另一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个data.frames - 一个查询表,告诉我一个组中包含的集合产品。每个组至少具有<1>类型1和类型2的一个产品。



第二个data.frame告诉我有关事务的详细信息。每个交易都可以具有以下产品之一:



a)只有来自其中一个组的类型1的 s



b)只有来自其中一个组的类型2的产品 s



c)类型的产品1和类型2 从同一组



对于我的分析,我有兴趣发现c)以上,即有多少交易有类型1 类型2(来自同一组的)出售。如果产品类型1的产品和来自不同组的类型2的产品在同一交易中出售,我们将完全忽略该交易。



因此,类型1或类型2的每个产品必须属于同一组。



这是我的查找表:

 > P_Lookup 
Group ProductID1 ProductID2
Group1 A 1
Group1 B 2
Group1 B 3
Group2 C 4
Group2 C 5
Group2 C 6
Group3 D 7
Group3 C 8
Group3 C 9
Group4 E 10
Group4 F 11
Group4 G 12
Group5 H 13
Group5 H 14
Group5 H 15

例如,我不会产品G和产品15在一个交易中,因为它们属于不同的组。



以下是交易:

  TransactionID ProductID ProductType 
a1 A 1
a1 B 1
a1 1 2
a2 C 1
a2 4 2
a2 5 2
a3 D 1
a3 C 1
a3 7 2
a3 8 2
a4 H 1
a5 1 2
a5 2 2
a5 3 2
a5 3 2
a5 1 2
a6 H 1
a6 15 2

我的代码



现在,我可以使用 dplyr 用于从一个组中更新列表交易。但是,我不知道我如何向所有组的代码向量化。



这是我的代码:

  P_Groups< -unique(P_Lookup $ Group)
Chosen_Group< -P_Groups [5]

P_Group_Ind < - P_Trans% >%
group_by(TransactionID)%>%
dplyr :: filter((ProductID%in%unique(P_Lookup [P_Lookup $ Group == Chosen_Group,] $ ProductID1))|
(ProductID%in%unique(P_Lookup [P_Lookup $ Group == Chosen_Group,] $ ProductID2)))%>%
mutate(No_of_PIDs = n_distinct(ProductType))%>%
mutate(Group_Name = Chosen_Group)

P_Group_Ind< -P_Group_Ind [P_Group_Ind $ No_of_PIDs> 1,]

只要手动选择每个组,即通过设置 Chosen_Group ,这样做很好。但是,我不知道如何自动化这个。一种方式,我在想的是使用循环,但是我知道R的美丽是矢量化的,所以我想远离使用for循环。



我真诚地感谢任何帮助。我花了差不多两天的时间。我查看了在r 中使用dplyr for循环,但似乎这个线程正在谈论一个不同的问题。






DATA:
这里是 dput for P_Trans

  structure(list(TransactionID = c(a1,a1,a1,a2,a2,
a2,a3,a3,a3 ,a3,a4,a5,a5,a5,a5,a5,
a6,a6),ProductID = B,1,C,4,5,D,
C,7,8,H,1 ,3,3,1,H,15),ProductType = c(1,
1,2,1,2,2,1,1,2,2 ,1,2,2,2,2,1,2)),.Names = c(TransactionID,
ProductID,ProductType),row.names = c(NA,18L ),class =data.frame)

这是 dput for P_Lookup

  structure(list(Group = c(Group1,Group1,Group1,Group2,
Group2,Group2,Group3 ,Group3,Group3,Group4,Group4,
Group4,Group5,Group5,Group5),ProductID1 = c(A,
B,B,C,C,C,D,C,C,E,F,G,H ,
H),ProductID2 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,
14,15)) ,.Names = c(Group,ProductID1,ProductID2),row.names = c(NA,
15L),class =data.frame)






这里是 dput()将查询表中不存在的产品添加到P_Trans中:

 结构(list(TransactionID = c a1,a1,a1,a2,a2,
a2,a3,a3,a3,a3,a4 a5,a5,a5,a5,
a6,a6,a7),ProductID = c(A,B ,C,4,5,
D,C,7,8,H,1,2,3 3,1,H,15,
22),ProductType = c(1,1,2,1,2,2,1,1,2,2,1, 2,2,
2,2,2,1,2,3)),.Names = c(TransactionID,ProductID,
ProductType),row.names = c NA, 19L),class =data.frame)


解决方案

以下是我希望能够帮助的一个整洁(dplyr,tidyr和purrr)解决方案。



请注意,使用 map_df 将所有结果作为数据框返回。如果您希望将其作为每个组的列表对象,则只需使用映射

 

$保存唯一的组以供以后使用
P_Groups< -
库(tidyr)
库唯一(P_Lookup $ Group)

#将查找表转换为产品ID和组
P_Lookup< - P_Lookup%>%
gather(ProductIDn,ProductID,ProductID1,ProductID2) %>%
select(ProductID,Group)%>%
distinct()%>%
nest(-ProductID,.key = Group)

#将组信息绑定到交易
#和下一个分析组
P_Trans< - P_Trans%>%
left_join(P_Lookup)%>%
过滤器(!map_lgl (Group,is.null))%>%
unnest(Group)%>%
group_by(TransactionID)

#迭代通过组生成结果
map(P_Groups,〜filter(P_Trans,Group ==。))%>%
map(〜mutate(。,No_of_PIDs = n_distinct(ProductType)))%>%
map_df(〜,(No,no_of_PIDs> 1))
#>资料来源:本地资料框[12 x 5]
#>组:TransactionID [4]
#>
#> TransactionID ProductID ProductType Group No_of_PIDs
#> < CHR> < CHR> < DBL> < CHR> < INT>
#> 1 a1 A 1 Group1 2
#> 2 a1 B 1 Group1 2
#> 3 a1 1 2 Group1 2
#> 4 a2 C 1 Group2 2
#> 5 a2 4 2 Group2 2
#> 6 a2 5 2 Group2 2
#> 7 a3 D 1 Group3 2
#> 8 a3 C 1 Group3 2
#> 9 a3 7 2 Group3 2
#> 10 a3 8 2 Group3 2
#> 11 a6 H 1 Group5 2
#> 12 a6 15 2 Group5 2


I have two data.frames--one look-up table that tells me a set products included in a group. Each group has at least one product of Type 1 and Type 2.

The second data.frame tells me details about the transaction. Each transaction can have one of the following products:

a) Only products of Type 1 from one of the groups

b) Only products of Type 2 from one of the groups

c) Product of Type 1 and Type 2 from the same group

For my analysis, I am interested in finding out c) above i.e. how many transactions have products of Type 1 and Type 2 (from the same group) sold. We will ignore the transaction altogether if Product of Type 1 and that of Type 2 from different groups that are sold in the same transaction.

Thus, each product of Type 1 or Type 2 MUST belong to the same group.

Here's my look up table:

> P_Lookup
   Group ProductID1 ProductID2
  Group1          A          1
  Group1          B          2
  Group1          B          3
  Group2          C          4
  Group2          C          5
  Group2          C          6
  Group3          D          7
  Group3          C          8
  Group3          C          9
  Group4          E         10
  Group4          F         11
  Group4          G         12
  Group5          H         13
  Group5          H         14
  Group5          H         15 

For instance, I won't have Product G and Product 15 in one transaction because they belong to different group.

Here are the transactions:

  TransactionID ProductID ProductType
             a1         A           1
             a1         B           1
             a1         1           2
             a2         C           1
             a2         4           2
             a2         5           2
             a3         D           1
             a3         C           1
             a3         7           2
             a3         8           2
             a4         H           1
             a5         1           2
             a5         2           2
             a5         3           2
             a5         3           2
             a5         1           2
             a6         H           1
             a6        15           2

My Code:

Now, I was able to write code using dplyr for shortlisting transactions from one group. However, I am not sure how I can vectorize my code for all groups.

Here's my code:

P_Groups<-unique(P_Lookup$Group)
Chosen_Group<-P_Groups[5]

P_Group_Ind <- P_Trans %>%
group_by(TransactionID)%>%
dplyr::filter((ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID1)) | 
(ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID2)) ) %>%
mutate(No_of_PIDs = n_distinct(ProductType)) %>%
mutate(Group_Name = Chosen_Group)

P_Group_Ind<-P_Group_Ind[P_Group_Ind$No_of_PIDs>1,]

This works well as long as I manually select each group i.e. by setting Chosen_Group. However, I am not sure how I can automate this. One way, I am thinking is to use for loop, but I know that the beauty of R is vectorization, so I want to stay away from using for loop.

I'd sincerely appreciate any help. I have spent almost two days on this. I looked at using dplyr in for loop in r, but it seems this thread is talking about a different issue.


DATA: Here's dput for P_Trans:

structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2", 
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5", 
"a6", "a6"), ProductID = c("A", "B", "1", "C", "4", "5", "D", 
"C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15"), ProductType = c(1, 
1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2)), .Names = c("TransactionID", 
"ProductID", "ProductType"), row.names = c(NA, 18L), class = "data.frame")

Here's dput for P_Lookup:

structure(list(Group = c("Group1", "Group1", "Group1", "Group2", 
"Group2", "Group2", "Group3", "Group3", "Group3", "Group4", "Group4", 
"Group4", "Group5", "Group5", "Group5"), ProductID1 = c("A", 
"B", "B", "C", "C", "C", "D", "C", "C", "E", "F", "G", "H", "H", 
"H"), ProductID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
14, 15)), .Names = c("Group", "ProductID1", "ProductID2"), row.names = c(NA, 
15L), class = "data.frame")


Here's the dput() after adding a product to P_Trans that doesn't exist in the look-up table:

structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2", 
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5", 
"a6", "a6", "a7"), ProductID = c("A", "B", "1", "C", "4", "5", 
"D", "C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15", 
"22"), ProductType = c(1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 
2, 2, 2, 1, 2, 3)), .Names = c("TransactionID", "ProductID", 
"ProductType"), row.names = c(NA, 19L), class = "data.frame")

解决方案

Below is a tidyverse (dplyr, tidyr, and purrr) solution that I hope will help.

Note that the use of map_df in the last line returns all results as a data frame. If you'd prefer it to be a list object for each group, then simply use map.

library(dplyr)
library(tidyr)
library(purrr)

# Save unique groups for later use
P_Groups <- unique(P_Lookup$Group)

# Convert lookup table to product IDs and Groups
P_Lookup <- P_Lookup %>% 
              gather(ProductIDn, ProductID, ProductID1, ProductID2) %>% 
              select(ProductID, Group) %>% 
              distinct() %>% 
              nest(-ProductID, .key = Group)

# Bind Group information to transactions
# and group for next analysis
P_Trans <- P_Trans %>%
             left_join(P_Lookup) %>%
             filter(!map_lgl(Group, is.null)) %>%  
             unnest(Group) %>% 
             group_by(TransactionID)

# Iterate through Groups to produce results
map(P_Groups, ~ filter(P_Trans, Group == .)) %>% 
  map(~ mutate(., No_of_PIDs = n_distinct(ProductType))) %>% 
  map_df(~ filter(., No_of_PIDs > 1))
#> Source: local data frame [12 x 5]
#> Groups: TransactionID [4]
#> 
#>    TransactionID ProductID ProductType  Group No_of_PIDs
#>            <chr>     <chr>       <dbl>  <chr>      <int>
#> 1             a1         A           1 Group1          2
#> 2             a1         B           1 Group1          2
#> 3             a1         1           2 Group1          2
#> 4             a2         C           1 Group2          2
#> 5             a2         4           2 Group2          2
#> 6             a2         5           2 Group2          2
#> 7             a3         D           1 Group3          2
#> 8             a3         C           1 Group3          2
#> 9             a3         7           2 Group3          2
#> 10            a3         8           2 Group3          2
#> 11            a6         H           1 Group5          2
#> 12            a6        15           2 Group5          2

这篇关于使用dplyr将一个data.frame的输出管道输出到另一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆