使用无中间表的data.table加入然后mutate [英] Join then mutate using data.table without intermediate table

查看：580 发布时间：2017/3/12 13:07:04 r data.table

本文介绍了使用无中间表的data.table加入然后mutate的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 data.table 中的初学者，搜索后执行join，然后mutate列。我找到了 data.table join然后将列添加到现有的data.frame而无需重新复制线程，但我无法继续进行。

请注意，我能够我想使用 dplyr ，但是由于大小，对实际数据运行这个代码是不可行的。另外，由于上述原因，我不能创建中间表。

这里是我的数据和解决方案使用 dplyr p>

输入

  DFI = （PO_ID = c（P1234，P1234，P1234，P1234，
P1234，P1234，P2345，P2345，P3456，P4567 SO_ID = c（S1，
S1，S1，S2，S2，S2，S3，S4，S7，S10），F_Year = c（2012，
 2012，2012，2013，2013，2013，2011，2011，2014，2015），Product_ID = c（385X，
385X，385X，450X ，450X，900X，3700，3700，A11U，
2700），Revenue = c（1,2,3,34,34,6,7,88 ，9,100），数量= c（1，
 2,3,8,8,6,7,8,9,40），Location1 = c（MA，NY，WA ，
NY，WA，NY，IL，IL，MN，CA））.Names = c（PO_ID，
SO_ID ，F_Year，Product_ID，Revenue，Quantity，Location1
），row.names = c（NA，10L），class =data.frame）

查询表

$ b
DF_Lookup = structure（list（PO_ID = c（P1234，P1234，P1234，P2345， P2345，P3456，P4567 ），SO_ID = c（S1，S2，S2，S3， S4，S7，S10），F_Year = c（2012,2013,2013,2011 ，2011， 2014，2015），Product_ID = c（385X，450X，900X，3700，3700， A11U，2700 = c（50,70,35,100，-50,50,100）， Quantity = c（3,20,20,20，-10,20,40）），.names = c（ PO_ID， SO_ID，F_Year，Product_ID，Revenue，Quantity），row.names = c（NA， 7L），class =data.frame ）
输出
DFO = structure（list（PO_ID = c（P1234，P1234，P1234，P1234， P1234，P1234 S2，S2，S2，S2，S2，S2，S2，S234，P2345，P2345，P3456，P4567 S3，S4，S7，S10），F_Year = c（2012， 2012,2012,2013,2013,2013,2011,2011,2014），Product_ID = c 3800X，3700，3700，A11U， 2700），Revenue =385X， 385X，385X c（16.6666666666667,16.6666666666667,16.6666666666667， 35,35,35,100，-50,50,100），数量= c（1,1,1,10,10， 20,20， -10,20,40），Location1 = c（MA，NY，WA，NY，WA， NY，IL，IL ，CA）），.Names = c（PO_ID，SO_ID， F_Year，Product_ID，Revenue，Quantity，Location1），row.names = c（NA， 10L），class =data.frame）
代码使用 dplyr

我在这里使用两个库： dplyr 和比较

我使用左连接将查找表中的新条目添加到 DFI 。然后，我将根据组中的行数除以收入和列。这是因为我希望在分组时防止数字通货膨胀。
DF_Generated< - DFI％>％ dplyr :: left_join（DF_Lookup，by = c（PO_ID，SO_ID，F_Year，Product_ID））％>％ dplyr :: group_by（PO_ID，SO_ID，F_Year，Product_ID）％ >％ dplyr :: mutate（Count = n（））％>％ dplyr :: ungroup（）％>％ dplyr :: mutate（Revenue = Revenue.y / Count，Quantity = Quantity.y / Count）％>％ dplyr :: select（PO_ID：Product_ID，Location1，Revenue，Quantity）
以下是输出的匹配方式：
compare（DF_Generated，DFO，allowAll = TRUE） TRUE

我真诚地感谢任何帮助。
解决方案
只需向DFI中添加列（在更新连接），而不是创建新表，效率更高：
DFI [DF_Lookup，on =。（PO_ID，SO_ID，F_Year，Product_ID）， `：=` newrev = i.Revenue / .N，newqty = i.Quantity / .N），by = .EACHI] PO_ID SO_ID F_Year Product_ID收入数量Location1 newrev newqty 1 ：P1234 S1 2012 385X 1 1 MA 16.66667 1 2：P1234 S1 2012 385X 2 2 NY 16.66667 1 3：P1234 S1 2012 385X 3 3 WA 16.66667 1 4：P1234 S2 2013 450X 34 8 NY 35.00000 10 5：P1234 S2 2013 450X 34 8 WA 35.00000 10 6：P1234 S2 2013 900X 6 6 NY 35.00000 20 7：P2345 S3 2011 3700 7 7 IL 100.00000 20 8：P2345 S4 2011 3700 88 8 IL -50.00000 -10 9：P3456 S7 2014 A11U 9 9 MN 50.00000 20 10：P4567 S10 2015 2700 100 40 CA 100.00000 40
这是在OP中链接的Q& A的一个很自然的扩展。
by = .EACHI 按 i 中的每一行分组 x [i，on =，j] ; 如果您要覆盖rev和qty列，请使用 .N `：=`（Revenue = i.Revenue / .N，Quantity = i.Quantity /.N）。 I am a beginner in data.table and searched around to do join and then mutate columns. I found data.table join then add columns to existing data.frame without re-copy thread, but I was not able to proceed further. Please note that I am able to what I want to do using dplyr, but it's not feasible to run this code on the actual data because of the size. Plus, for aforementioned reason, I cannot create intermediate tables. Here are my data and solution using dplyr Input DFI = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", "P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", "S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", "385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", "2700"), Revenue = c(1, 2, 3, 34, 34, 6, 7, 88, 9, 100), Quantity = c(1, 2, 3, 8, 8, 6, 7, 8, 9, 40), Location1 = c("MA", "NY", "WA", "NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", "SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1" ), row.names = c(NA, 10L), class = "data.frame") Look Up Table DF_Lookup = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", "450X", "900X", "3700", "3700", "A11U", "2700"), Revenue = c(50, 70, 35, 100, -50, 50, 100), Quantity = c(3, 20, 20, 20, -10, 20, 40)), .Names = c("PO_ID", "SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity"), row.names = c(NA, 7L), class = "data.frame") Output DFO = structure(list(PO_ID = c("P1234", "P1234", "P1234", "P1234", "P1234", "P1234", "P2345", "P2345", "P3456", "P4567"), SO_ID = c("S1", "S1", "S1", "S2", "S2", "S2", "S3", "S4", "S7", "S10"), F_Year = c(2012, 2012, 2012, 2013, 2013, 2013, 2011, 2011, 2014, 2015), Product_ID = c("385X", "385X", "385X", "450X", "450X", "900X", "3700", "3700", "A11U", "2700"), Revenue = c(16.6666666666667, 16.6666666666667, 16.6666666666667, 35, 35, 35, 100, -50, 50, 100), Quantity = c(1, 1, 1, 10, 10, 20, 20, -10, 20, 40), Location1 = c("MA", "NY", "WA", "NY", "WA", "NY", "IL", "IL", "MN", "CA")), .Names = c("PO_ID", "SO_ID", "F_Year", "Product_ID", "Revenue", "Quantity", "Location1"), row.names = c(NA, 10L), class = "data.frame") Here's my code using dplyr I am using two libraries here: dplyr and compare I am using left join to add new entries from the Look Up table into DFI. Then I am dividing the revenue and column based on the number of rows in a group. This is because I want to prevent inflation of numbers when grouped. DF_Generated <- DFI %>% dplyr::left_join(DF_Lookup,by = c("PO_ID", "SO_ID", "F_Year", "Product_ID")) %>% dplyr::group_by(PO_ID, SO_ID, F_Year, Product_ID) %>% dplyr::mutate(Count = n()) %>% dplyr::ungroup()%>% dplyr::mutate(Revenue = Revenue.y/Count, Quantity = Quantity.y/Count) %>% dplyr::select(PO_ID:Product_ID,Location1,Revenue,Quantity) Here's how the output matches: compare(DF_Generated,DFO,allowAll = TRUE) TRUE I'd sincerely appreciate any help. 解决方案 It's more efficient to simply add columns to DFI (in an "update join"), rather than making a new table: DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID), `:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N) , by=.EACHI] PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1 newrev newqty 1: P1234 S1 2012 385X 1 1 MA 16.66667 1 2: P1234 S1 2012 385X 2 2 NY 16.66667 1 3: P1234 S1 2012 385X 3 3 WA 16.66667 1 4: P1234 S2 2013 450X 34 8 NY 35.00000 10 5: P1234 S2 2013 450X 34 8 WA 35.00000 10 6: P1234 S2 2013 900X 6 6 NY 35.00000 20 7: P2345 S3 2011 3700 7 7 IL 100.00000 20 8: P2345 S4 2011 3700 88 8 IL -50.00000 -10 9: P3456 S7 2014 A11U 9 9 MN 50.00000 20 10: P4567 S10 2015 2700 100 40 CA 100.00000 40 This is a pretty natural extension of the Q&A linked in the OP. The by=.EACHI groups by each row of i in x[i,on=,j]; and .N is how many rows the group has. If you want the rev and qty cols overwritten, use `:=`(Revenue = i.Revenue/.N, Quantity = i.Quantity/.N). 这篇关于使用无中间表的data.table加入然后mutate的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用无中间表的data.table加入然后mutate [英] Join then mutate using data.table without intermediate table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用无中间表的data.table加入然后mutate [英] Join then mutate using data.table without intermediate table

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭