使用无中间表的data.table加入然后mutate [英] Join then mutate using data.table without intermediate table

查看:580
本文介绍了使用无中间表的data.table加入然后mutate的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 data.table 中的初学者,搜索后执行join,然后mutate列。我找到了 data.table join然后将列添加到现有的data.frame而无需重新复制线程,但我无法继续进行。



请注意,我能够我想使用 dplyr ,但是由于大小,对实际数据运行这个代码是不可行的。另外,由于上述原因,我不能创建中间表。



这里是我的数据和解决方案使用 dplyr p>

输入

  DFI = (PO_ID = c(P1234,P1234,P1234,P1234,
P1234,P1234,P2345,P2345,P3456,P4567 SO_ID = c(S1,
S1,S1,S2,S2,S2,S3,S4,S7,S10),F_Year = c(2012,
2012,2012,2013,2013,2013,2011,2011,2014,2015),Product_ID = c(385X,
385X,385X,450X ,450X,900X,3700,3700,A11U,
2700),Revenue = c(1,2,3,34,34,6,7,88 ,9,100),数量= c(1,
2,3,8,8,6,7,8,9,40),Location1 = c(MA,NY,WA ,
NY,WA,NY,IL,IL,MN,CA)).Names = c(PO_ID,
SO_ID ,F_Year,Product_ID,Revenue,Quantity,Location1
),row.names = c(NA,10L),class =data.frame)

查询表


$ b

  DF_Lookup = structure(list(PO_ID = c(P1234,P1234,P1234,P2345,
P2345,P3456,P4567 ),SO_ID = c(S1,S2,S2,S3,
S4,S7,S10),F_Year = c(2012,2013,2013,2011 ,2011,
2014,2015),Product_ID = c(385X,450X,900X,3700,3700,
A11U,2700 = c(50,70,35,100,-50,50,100),
Quantity = c(3,20,20,20,-10,20,40)),.names = c( PO_ID,
SO_ID,F_Year,Product_ID,Revenue,Quantity),row.names = c(NA,
7L),class =data.frame )

输出

  DFO = structure(list(PO_ID = c(P1234,P1234,P1234,P1234,
P1234,P1234 S2,S2,S2,S2,S2,S2,S2,S234,P2345,P2345,P3456,P4567 S3,S4,S7,S10),F_Year = c(2012,
2012,2012,2013,2013,2013,2011,2011,2014),Product_ID = c 3800X,3700,3700,A11U,
2700),Revenue =385X,
385X,385X c(16.6666666666667,16.6666666666667,16.6666666666667,
35,35,35,100,-50,50,100),数量= c(1,1,1,10,10,
20,20, -10,20,40),Location1 = c(MA,NY,WA,NY,WA,
NY,IL,IL ,CA)),.Names = c(PO_ID,SO_ID,
F_Year,Product_ID,Revenue,Quantity,Location1),row.names = c(NA,
10L),class =data.frame)

代码使用 dplyr



我在这里使用两个库: dplyr 比较



我使用左连接将查找表中的新条目添加到 DFI 。然后,我将根据组中的行数除以收入和列。这是因为我希望在分组时防止数字通货膨胀。

  DF_Generated<  -  DFI%>%
dplyr :: left_join(DF_Lookup,by = c(PO_ID,SO_ID,F_Year,Product_ID))%>%
dplyr :: group_by(PO_ID,SO_ID,F_Year,Product_ID)% >%
dplyr :: mutate(Count = n())%>%
dplyr :: ungroup()%>%
dplyr :: mutate(Revenue = Revenue.y / Count,Quantity = Quantity.y / Count)%>%
dplyr :: select(PO_ID:Product_ID,Location1,Revenue,Quantity)

以下是输出的匹配方式:

  compare(DF_Generated,DFO,allowAll = TRUE)
TRUE



我真诚地感谢任何帮助。

解决方案

只需向DFI中添加列(在更新连接),而不是创建新表,效率更高:

  DFI [DF_Lookup,on =。(PO_ID,SO_ID,F_Year,Product_ID),
`:=` newrev = i.Revenue / .N,newqty = i.Quantity / .N)
,by = .EACHI]

PO_ID SO_ID F_Year Product_ID收入数量Location1 newrev newqty
1 :P1234 S1 2012 385X 1 1 MA 16.66667 1
2:P1234 S1 2012 385X 2 2 NY 16.66667 1
3:P1234 S1 2012 385X 3 3 WA 16.66667 1
4:P1234 S2 2013 450X 34 8 NY 35.00000 10
5:P1234 S2 2013 450X 34 8 WA 35.00000 10
6:P1234 S2 2013 900X 6 6 NY 35.00000 20
7:P2345 S3 2011 3700 7 7 IL 100.00000 20
8:P2345 S4 2011 3700 88 8 IL -50.00000 -10
9:P3456 S7 2014 A11U 9 9 MN 50.00000 20
10:P4567 S10 2015 2700 100 40 CA 100.00000 40

这是在OP中链接的Q& A的一个很自然的扩展。



by = .EACHI i 中的每一行分组 x [i,on =,j] ;



如果您要覆盖rev和qty列,请使用 .N `:=`(Revenue = i.Revenue / .N,Quantity = i.Quantity /.N)


I am a beginner in data.table and searched around to do join and then mutate columns. I found data.table join then add columns to existing data.frame without re-copy thread, but I was not able to proceed further.

Please note that I am able to what I want to do using dplyr, but it's not feasible to run this code on the actual data because of the size. Plus, for aforementioned reason, I cannot create intermediate tables.

Here are my data and solution using dplyr

Input

DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1, 
2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA", 
"NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"
), row.names = c(NA, 10L), class = "data.frame")

Look Up Table

DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345", 
"P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3", 
"S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011, 
2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700", 
"A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100), 
    Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID", 
"SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA, 
7L), class = "data.frame")

Output

DFO = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", 
"P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", 
"S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 
2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", 
"385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", 
"2700"), Revenue = c(16.6666666666667, 16.6666666666667, 16.6666666666667, 
35, 35, 35, 100, -50, 50, 100), Quantity = c(1, 1, 1, 10, 10, 
20, 20, -10, 20, 40), Location1 = c("MA", "NY", "WA", "NY", "WA", 
"NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", "SO_ID", 
"F_Year", "Product_ID", "Revenue", "Quantity", "Location1"), row.names = c(NA, 
10L), class = "data.frame")

Here's my code using dplyr

I am using two libraries here: dplyr and compare

I am using left join to add new entries from the Look Up table into DFI. Then I am dividing the revenue and column based on the number of rows in a group. This is because I want to prevent inflation of numbers when grouped.

DF_Generated <- DFI %>% 
  dplyr::left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>%
  dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>%
  dplyr::mutate(Count = n()) %>%
  dplyr::ungroup()%>%
  dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>%
  dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity)

Here's how the output matches:

compare(DF_Generated,DFO,allowAll = TRUE)
TRUE

I'd sincerely appreciate any help.

解决方案

It's more efficient to simply add columns to DFI (in an "update join"), rather than making a new table:

DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID), 
  `:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N)
, by=.EACHI]

    PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1    newrev newqty
 1: P1234    S1   2012       385X       1        1        MA  16.66667      1
 2: P1234    S1   2012       385X       2        2        NY  16.66667      1
 3: P1234    S1   2012       385X       3        3        WA  16.66667      1
 4: P1234    S2   2013       450X      34        8        NY  35.00000     10
 5: P1234    S2   2013       450X      34        8        WA  35.00000     10
 6: P1234    S2   2013       900X       6        6        NY  35.00000     20
 7: P2345    S3   2011       3700       7        7        IL 100.00000     20
 8: P2345    S4   2011       3700      88        8        IL -50.00000    -10
 9: P3456    S7   2014       A11U       9        9        MN  50.00000     20
10: P4567   S10   2015       2700     100       40        CA 100.00000     40

This is a pretty natural extension of the Q&A linked in the OP.

The by=.EACHI groups by each row of i in x[i,on=,j]; and .N is how many rows the group has.

If you want the rev and qty cols overwritten, use `:=`(Revenue = i.Revenue/.N, Quantity = i.Quantity/.N).

这篇关于使用无中间表的data.table加入然后mutate的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆