“递归"自我加入data.table [英] "recursive" self join in data.table
问题描述
我有一个由三列组成的组件列表:产品,组件和使用的组件数量:
I have a component list made of 3 columns: product, component and quantity of component used:
a <- structure(list(prodName = c("prod1", "prod1", "prod2", "prod3",
"prod3", "int1", "int1", "int2", "int2"), component = c("a",
"int1", "b", "b", "int2", "a", "b", "int1", "d"), qty = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L)), row.names = c(NA, -9L), class = c("data.table",
"data.frame"))
prodName component qty
1 prod1 a 1
2 prod1 int1 2
3 prod2 b 3
4 prod3 b 4
5 prod3 int2 5
6 int1 a 6
7 int1 b 7
8 int2 int1 8
9 int2 d 9
以prod
开头的产品是最终产品,以int
之类名称的产品是中间产品,以字母的产品是原材料.
Products with names starting with prod
are final products, those with names like int
are intermediate products, and those with letters are raw materials.
我需要完整的最终产品组件列表,其中只包含原材料.也就是说,我想将任何int
都转换为原材料.
I need the full component list of final products with only raw materials as components. That is, I want to convert any int
into raw materials.
- 中间产品可以由原材料和另一种中间产品组成,因此我指的是递归".
- 我无法预先知道中间产品的嵌套/递归级别(在此示例中为2个级别,实际数据中超过6个级别).
在此示例中,我的预期结果是(我明确声明了结果数的计算):
For this example, my expected result is (I explicitly stated the computation of the resulting number):
prodName |component |qty
prod1 |a |1+2*6 = 13
prod1 |b |0+2*7 = 14
prod2 |b |3
prod3 |b |4+5*8*7 = 284
prod3 |a |0+5*8*6 = 240
prod3 |d |0+5*9 = 45
我做了什么:
我通过使用merge
创建非常繁琐的连接序列解决了这个问题.虽然这种方法适用于玩具数据,但我不太可能将其应用于真实数据.
What I have done:
I solved this by creating a very cumbersome sequence of joins with merge
. While this approach worked for the toy data, it's unlikely I can apply it to the real one.
#load data.table
library(data.table)
# split the tables between products and different levels of intermediate
a1 <- a[prodName %like% "prod",]
b1 <- a[prodName %like% "int1",]
c1 <- a[prodName %like% "int2",]
# convert int2 to raw materials
d1 <- merge(c1,
b1,
by.x = "component",
by.y = "prodName",
all.x = TRUE)[
is.na(component.y),
component.y := component][
is.na(qty.y),
qty.y := 1][,
.(prodName, qty = qty.x*qty.y),
by = .(component = component.y)]
# Since int1 is already exploded into raw materials, rbind both tables:
d1 <- rbind(d1, b1)
# convert all final products into raw materials, except that the raw mats that go directly into the product won't appear:
e1 <- merge(a1,
d1,
by.x = "component",
by.y = "prodName",
all.x = TRUE)
# rbind the last calculated raw mats (those coming from intermediate products) with those coming _directly_ into the final product:
result <- rbind(e1[!is.na(qty.y),
.(prodName, qty = qty.x * qty.y),
by = .(component = component.y)],
e1[is.na(qty.y),
.(prodName, component, qty = qty.x)])[,
.(qty = sum(qty)),
keyby = .(prodName, component)]
我知道我可以将数据拆分为表并执行连接,直到每个中间产品都表示为仅由原材料组成,但是如上所述,由于数据的大小和级别的限制,这将是最后的选择中间产品的递归.
I'm aware I can split the data into tables and perform joins until every intermediate product is expressed as composed by only raw materials, but as mentioned above, that will be a last resort due to the size of data and levels of recursion of intermediate products.
是否有更简单/更好的方法来进行这种递归联接?
Is there an easier / better way to do this sort of recursive join?
推荐答案
这是我尝试使用您的数据集的尝试.
Here's my attempt using your dataset.
它使用while
循环检查来查看prodName
字段中是否还有任何components
.循环始终需要具有相同的字段,因此,不需要为递归乘法器添加一列(即末尾为5 * 8 * 7),而是对迭代乘法器进行集成.也就是说,5 * 8 * 7最终变为5 * 56.
It uses a while
loop checking to see if there's any components
that also are in the prodName
field. The loop always needs to have the same fields so instead of adding a column for the recursive multipliers (i.e., 5*8*7 at the end), the iterative multipliers are integrated. That is, 5*8*7 becomes 5*56 at the end.
library(data.table)
a[, qty_multiplier := 1]
b <- copy(a)
while (b[component %in% prodName, .N] > 0) {
b <- b[a
, on = .(prodName = component)
, .(prodName = i.prodName
, component = ifelse(is.na(x.component), i.component, x.component)
, qty = i.qty
, qty_multiplier = ifelse(is.na(x.qty), 1, x.qty * qty_multiplier)
)
]
}
b[prodName %like% 'prod', .(qty = sum(qty * qty_multiplier)), by = .(prodName, component)]
prodName component qty
1: prod1 a 13
2: prod1 b 14
3: prod2 b 3
4: prod3 b 284
5: prod3 a 240
6: prod3 d 45
这篇关于“递归"自我加入data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!