R,dplyr:可根据条件快速构建互补行列表的函数 [英] R, dplyr: Function that quickly builds list of complementary rows based on conditions

查看:127
本文介绍了R,dplyr:可根据条件快速构建互补行列表的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个约80,000行乘26列的数据集。这些行对应于 SKU或机器人构建集的唯一ID。这些列对应于26个不同的机器人零件。一个单元包含一部分对构建整个机器人的贡献。一行的总和可能不等于1.0,因为一个建筑集不一定总会拥有构建整个机器人所需的零件的100%。

I have a data set of ~80,000 rows by 26 columns. The rows correspond to "SKUs" or unique IDs for robot building sets. The columns correspond to 26 different robot parts. A cell contains a part's contribution towards building a whole robot. The sum of a row's proportion may not sum to 1.0 since a building set won't always have 100% of the parts needed to build a whole robot.

主要目标是构建一个接受SKU作为输入并输出互补SKU列表的函数。互补行的定义为:

The main goal is to build a function that accepts a SKU as input and outputs a list of complementary SKUs. A complementary row is defined as:


  1. 如果给定行的列的值非零,则补码的值必须为零

目标是找到与给定SKU互补的所有可能的SKU集,从而可以构建整个机器人。此外,重要的是要看到这套科学怪人 SKU的每个机器人的加权收入( weightedPrice )。还很高兴地展示 weightedPrice 随着每个补充SKU的增加而发生的变化。

The goal is to find all possible sets of SKUs that complement a given SKU such that a whole robot can be built. Additionally, it is important to see the weighted revenue per robot ("weightedPrice") for this "Frankenstein" set of SKUs. It is also nice to show how the weightedPrice changes with the addition of each complementary SKU.

最低工作量,玩具示例(MWE):

A minimum working, toy example (MWE):

set.seed(1)
a = runif(n=60, min=0, max=0.2)
a[a<0.12] = 0
n = 10
A = as.data.frame(matrix(a,              
                         nrow=n,         
                         ncol=6,         
                         byrow = TRUE))
A$rowTally <- rowSums(A != 0)
A$sku <- seq(from = 1, to = n)
A$totalDollarSales <- runif(n=n, min=1*10^2, max=1*10^6)
A$totalUnitSales <- runif(n=n, min=1*10^2, max=1*10^6)
names(A) <- c("p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "p6_prop", "rowTally", "sku", "totalDollarSales", "totalUnitSales")
A <- A[c("sku", "p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "p6_prop", "rowTally", "totalDollarSales", "totalUnitSales")]
A$dollarsPerRobot <- A$totalDollarSales/A$totalUnitSales
A

   sku   p1_prop   p2_prop   p3_prop   p4_prop   p5_prop   p6_prop rowTally
1    1 0         0         0         0.1816416 0         0.1796779        2
2    2 0.1889351 0.1321596 0.1258228 0         0         0                3
3    3 0.1374046 0         0.1539683 0         0.1435237 0.1983812        4
4    4 0         0.1554890 0.1869410 0         0.1303348 0                3
5    5 0         0         0         0         0.1739382 0                1
6    6 0         0         0         0         0.1654747 0.1336933        2
7    7 0.1588480 0         0.1447422 0         0.1641893 0.1294120        4
8    8 0.1565866 0         0         0.1578712 0         0                2
9    9 0.1464627 0.1385463 0         0.1722419 0         0                3
10  10 0         0         0         0         0.1324010 0                1
   totalDollarSales totalUnitSales dollarsPerRobot
1         912884.64       339139.0       2.6917711
2         293674.01       839456.4       0.3498383
3         459119.82       346748.8       1.3240703
4         332461.43       333841.6       0.9958659
5         650905.38       476403.6       1.3662898
6         258090.98       892209.1       0.2892718
7         478597.39       864353.0       0.5537059
8         766334.04       390050.5       1.9647044
9          84338.49       777343.0       0.1084959
10        875333.80       960621.9       0.9112157

我正在尝试编写一个函数:

I'm trying to write a function:

def frankensteinRobot(df, sku, skuRowTally):
    1. find another SKU in dataframe, df.
       - must have non-overlapping parts with existing SKU set
       - rowTally <= skuRowTally (want to find small SKUs to add)
       - must be relatively same number of totalUnitSales
    2. append new SKU to list, and take mininum of totalUnitSales. 
    3. Calculate the weighted, per robot price
       dollarsPerRobotSKU_1*(1/length(SKU_list))+...+dollarsPerRobotSKU_n*(1/length(SKU_list)) 
       and append to the end of a list so we can track profitability with each additional SKU.
    4. repeat steps 1, 2 & 3.

我只能弄清楚如何找到下一个补充SKU,但不是完整的SKU:

I've only been able to figure out how to find the next complementary SKU, but not the full set of SKUs:

A_candidates <- sapply(data.frame(outer(1:nrow(A), 1:nrow(A), Vectorize(check_compliment))), which)

示例输入:

frankensteinRobot(df = A, sku = 5, skuRowTally = 3)

示例输出(请注意,由于我的MWE仅包含10行,因此示例输出列表仅包含2个元素,但实际上它们会更长另外,我不确定哪种数据结构合适。也许是其中一列是列表的数据框?):

Example output (note that because my MWE only has 10 rows, the example output lists only have 2 elements, but in actually they will be longer. Also, I'm unsure as to what data structure is appropriate. Maybe a dataframe where 1 column is a list?):

[list of SKUs]; [propSum]; [maxLb]; [list of weightedPrice]

[5, 2]; [propSum=0.6208557]; [maxLb=476403.6]; [0.8580641)
[5, 8]; [propSum=0.488396]; [maxLb=390050.5]; [1.665497]
[5, 9]; [propSum=0.6311891]; [maxLb=476403.6]; [0.7373929]

示例输入:

frankensteinRobot(df = A, sku = 6, skuRowTally = 2)

示例输出:

[6, 8]; [propSum=0.6136258]; [maxLb=390050.5]; [1.126988]


推荐答案

最终编辑:此解决方案依赖于data.table,并且可以在最后使用循环。不过,您可以继续复制和粘贴以使其正常运行。该解决方案主要依赖于这种快速的解决方案:

Final Edit: This solution relies on data.table and could use a loop at the end. Still, you could keep on copying and pasting to make it work. The solution relies mainly on this blazingly quick solution:

  search_dt <- dt[sku1 == searchSKU]
  current_parts <- names(search_dt[, .SD, .SDcols = part_names])[which(search_dt[, .SD, .SDcols = part_names]>0)]

  steal_dt <- dt[rowTally <= searchRowTally]

#returns SKUs which are 0 for the parts we already have
  steal_dt <- steal_dt[steal_dt[, j = rowSums(.SD) == 0, .SDcols = current_parts]]

  franken_rob <- cbind(search_dt, steal_dt)

我对某些标准仍然不确定。我假设在组装FrankenBot时,每个后续的SKU都不能位于先前的零件中。换句话说,SKU3与SKU1和SKU2没有任何共同之处。

I'm still uncertain on some of the criteria. I assume that as FrankenBot is assembled, each successive SKU can't be in the previous parts. In other words, SKU3 doesn't have any parts in common with SKU1 and SKU2.

解决方案输出[我又编辑了一次...] {还有另一遍...}:

Solution output [I edited it one more time...]{and yet another...}:

# A tibble: 15 x 8
    sku1  sku2  sku3 propSums Parts Robots dollarsPerRobot totalUnitSales
   <int> <int> <int>    <dbl> <dbl>  <dbl>           <dbl>          <dbl>
 1     1     2     5    0.982     6      3           1.47         339139.
 2     1     2    10    0.941     6      3           1.32         339139.
 3     1     4    NA    0.834     5      2           1.84         333842.
 4     1     5    NA    0.535     3      2           2.03         339139.
 5     1    10    NA    0.494     3      2           1.80         339139.
 6     2     5    NA    0.621     4      2           0.858        476404.
 7     2     6    NA    0.746     5      2           0.320        839456.
 8     2    10    NA    0.579     4      2           0.631        839456.
 9     4     8    NA    0.787     5      2           1.48         333842.
10     5     8    NA    0.488     3      2           1.67         390051.
11     5     9    NA    0.631     4      2           0.737        476404.
12     6     8    NA    0.614     4      2           1.13         390051.
13     6     9    NA    0.756     5      2           0.199        777343.
14     8    10    NA    0.447     3      2           1.44         390051.
15     9    10    NA    0.590     4      2           0.510        777343.

解决方案代码:

library(data.table)
# generate data -----------------------------------------------------------

set.seed(1)
n = 10
cols = 6 #added
part_names =  paste0('p', c(1:cols), '_prop')

a = runif(n* cols, min=0, max=0.2)
a[a<0.12] = 0

A <- data.table(matrix(a, nrow=n, ncol=cols,byrow = TRUE))
A[, `:=`(rowTally1 = rowSums(.SD != 0),
         sku1 = .I
         ,totalDollarSales1 = runif(n=n, min=1*10^2, max=1*10^6)
         ,totalUnitSales1 =  runif(n=n, min=1*10^2, max=1*10^6))]

A[, dollarsPerRobot1:=totalDollarSales1/totalUnitSales1]

setnames(A, c(paste0('V',1:cols)), part_names)
setcolorder(A, 'sku1')

non_part_names<- setdiff(names(A), c('sku1',part_names))
non_part_names<- stringr::str_sub(non_part_names, 1, -2)

search_fun <- function (search_dt, steal_dt, searchSKU, b_loop = FALSE, sale_range = NULL) {

  sku_count<- length(grep('sku', names(search_dt)))
  skus <- paste0('sku', 1:(sku_count+1))

  non_parts<- paste0(non_part_names, rep(1:(sku_count+1), each = length(non_part_names)))

  blank_table <- setnames(data.table(matrix(nrow = 0, ncol = length(search_dt) + 1 + length(non_part_names))),c(skus,part_names, non_parts))

  if (length(searchSKU) != sku_count) {
    stop('not enough SKUs to go around')
  } 

  for (i in 1:length(searchSKU)) {
    search_dt <- search_dt[get(paste0('sku', i)) == searchSKU[i]]
  }
  current_parts <- names(search_dt[, .SD, .SDcols = part_names])[which(search_dt[, .SD, .SDcols = part_names]>0)]
  search_dt[, (setdiff(part_names, current_parts)) := NULL, ]

  # Could be made faster if sku1s were filtered out to whichever ones were is sku.N 
  # Right now it still looks through skus that may have already been filtered out.

  if (!is.null(sale_range)) {
    if (length(sale_range) != 2) {
      warning('Sale range needs to be length two with sale_range[1] = lower range and sale_range[2] = upper range')
    } else {
    steal_dt <- steal_dt[between(totalUnitSales1, sale_range[1] * search_dt$totalUnitSales1, search_dt$totalUnitSales1 * sale_range[2])]
    }
  }


  if (b_loop) {
    steal_dt <- steal_dt[sku1 > searchSKU[sku_count]]
  }

  steal_dt <- steal_dt[steal_dt[, j = rowSums(.SD) == 0, .SDcols = current_parts]]
  if (nrow(steal_dt) == 0) {
    return(blank_table)
  }

  steal_dt[, (current_parts) := NULL]
  setnames(steal_dt,
           c('sku1', paste0(non_part_names, '1')) ,
           c(paste0('sku',sku_count+1),
             paste0(non_part_names, sku_count+1))
  )

  franken_rob <- cbind(search_dt, steal_dt)
  setcolorder(franken_rob, c(skus, part_names))
  return(franken_rob)

}

searchRowTally <- 3
dt_search <- A

#this is done outside the function because there can be a lot of looping otherwise
dt_steal <- dt_search[rowTally1 <= searchRowTally]

#Near-instant with 80,000 rows and 26 columns
search_fun(dt_search, dt_steal, dt_search$sku1[5])
search_fun(dt_search, dt_steal, dt_search$sku1[5], b_loop = TRUE)
search_fun(dt_search, dt_steal, dt_search$sku1[5], sale_range = c(0.8, 1.2))
search_fun(dt_search, dt_steal, dt_search$sku1[5], b_loop = TRUE, sale_range = c(0.8, 1.2))

#Not doable with 80,000 rows, but still nice
rbindlist(lapply(1:(n-1), function (i) search_fun(dt_search, dt_steal, dt_search$sku1[i], b_loop = TRUE)))
rbindlist(lapply(1:(n-1), function (i) search_fun(dt_search, dt_steal, dt_search$sku1[i], b_loop = TRUE, sale_range = c(0.8, 1.2))))

#much more likely that the first regression would be a single value
# frank_1 <- search_fun(dt_search, dt_steal, dt_search$sku1[5], FALSE)
frank_1 <-  rbindlist(lapply(1:(n-1), function (i) search_fun(dt_search, dt_steal, dt_search$sku1[i],  TRUE)))

#This takes every n-1 of each sku1 group. 
frank_2 <- frank_1[frank_1[, head(.I, -1), by = sku1]$V1]
# frank_2 <- frank_1[, j = if(.N!=1) .SD, by = sku1]
dt_steal2 <- dt_steal[sku1 %in% base::unique(frank_1$sku2)]

frank_2 = rbindlist(lapply(1:nrow(frank_2), function (i) search_fun(frank_2, dt_steal2, melt(frank_2[i, .SD, .SDcols = grep('sku', names(frank_2))])[[2]],  TRUE)))

frank_3 <- frank_2[frank_2[, head(.I, -1), by = sku2]$V1]
dt_steal3 <- dt_steal2[sku1 %in% base::unique(frank_2$sku3)]

frank_3 = rbindlist(lapply(1:nrow(frank_3), function (i) search_fun(frank_3, dt_steal3, melt(frank_3[i, .SD, .SDcols = grep('sku', names(frank_3))])[[2]],  TRUE)))


# start combindine our lists

franken_rob <- frank_1[!frank_2, on = c('sku1', 'sku2')]
franken_rob[, j= sku3:= integer()]
setcolorder(franken_rob, c('sku1','sku2','sku3'))

franken_rob <- rbind(frank_2, franken_rob, fill = TRUE)
#do above for frank_n times)

franken_rob[, `:=`(propSums=rowSums(.SD),
                   Parts = rowSums(.SD > 0))
            , .SDcols = part_names]

franken_rob[, Robots:= rowSums(.SD > 0, na.rm = TRUE), .SDcols = grep('sku', names(franken_rob))]
franken_rob[, dollarsPerRobot := rowSums(.SD, na.rm = TRUE) / Robots, .SDcols = grep ('dollarsPerRobot', names(franken_rob))]
franken_rob[, totalUnitSales := do.call(pmin,  c(.SD, list(na.rm = TRUE))), .SDcols = grep('totalUnitSales', names(franken_rob))]

franken_rob[, (part_names) := NULL]
franken_rob

tibble::as_tibble(franken_rob[, c(1:3, 16, 17, 18, 19,20)])

编辑:我没有足够的代表对此发表评论-尝试时在具有80,000行和26列的data.table解决方案中,当 rowTally< = 13 时,它尝试分配2.3 GB的向量。但是,当我将其更改为3时,它将产生110万行,并筛选到30万行。这是超级笛卡尔。

Edit: I don't have enough rep to comment - when trying the data.table solution with 80,000 rows and 26 columns, it tries to allocate a 2.3 GB vector when rowTally <= 13. However, when I change that to 3, it makes 1.1 million rows and filters down to 0.3 million rows. This is super cartesian.

原始:这是一个dplyr解决方案,似乎可以处理80,000行和26列。诀窍是找出子列sku的哪些列的结果为非零。使用这些列,我回到了原始df并进行了过滤。

Original: Here is a dplyr solution that seems to work with 80,000 rows and 26 columns. The trick was to figure out which columns had a non-zero result for the subset sku. With those columns, I went back to the original df and filtered.

在unitSales的某个范围内也有一行注释。

There's also a line commented out for the unitSales being in some range.


set.seed(1)
n = 10
cols = 6 #added

part_names =  paste0('p', c(1:cols), '_prop') #added
a = runif(n * cols, min=0, max=0.2) #changed from n to n * cols
a[a<0.12] = 0

A = as.data.frame(matrix(a,              
                         nrow=n,         
                         ncol=cols,  #changed to cols      
                         byrow = TRUE))
A$rowTally <- rowSums(A != 0)
A$sku <- seq(from = 1, to = n)
A$totalDollarSales <- runif(n=n, min=1*10^2, max=1*10^6)
A$totalUnitSales <- runif(n=n, min=1*10^2, max=1*10^6)
names(A) <- c(part_names, "rowTally", "sku", "totalDollarSales", "totalUnitSales")
A <- A[c("sku", part_names, "rowTally", "totalDollarSales", "totalUnitSales")]
A$dollarsPerRobot <- A$totalDollarSales/A$totalUnitSales


library(dplyr)

df <- as_tibble(A)%>%
  mutate(propSum = rowSums(.[, part_names]))

search_sku <- 5
skuRowTally <- 3

search_df <- df%>%
  filter(sku == search_sku)

current_parts <- search_df%>%
  select(part_names)%>%
  select_if(~sum(.)> 0)%>%
  names()

non_current_parts <- search_df%>%
  select(part_names)%>%
  select_if(~sum(.) == 0)%>%
  names()

df%>%
  filter(rowTally <= skuRowTally,
         sku != search_sku
         # , between(totalUnitSales, 0.7 * search_df$totalUnitSales, 1.3 * search_df$totalUnitSales)
         )%>%
  filter_at(vars(current_parts), all_vars(. == 0))%>%
  filter_at(vars(non_current_parts), any_vars(. != 0))%>%
  rowwise()%>%
  transmute(sku_search = search_sku,
            sku = sku,
            propSum = propSum + search_df$propSum,
            minLB = min(totalUnitSales, search_df$totalUnitSales),
            weightedPrice = (dollarsPerRobot + search_df$dollarsPerRobot) / 2,
            total_parts = rowTally + search_df$rowTally,
            complete_robot = if_else(total_parts == cols, 'COMPLETE', 'incomplete')
  )%>%
  ungroup()


frankensteinRobot <- function (df, sku1, skuTally) {
  # df <- as_tibble(df)%>%
  #   mutate(propSum = rowSums(.[, part_names]))


#part_name and cols would also need to be passed to make this
#completely stand alone.  

  search_sku <- sku1
  skuRowTally <- skuTally

  search_df <- df%>%
    filter(sku == search_sku)

  current_parts <- search_df%>%
    select(part_names)%>%
    select_if(~sum(.)> 0)%>%
    names()

  non_current_parts <- search_df%>%
    select(part_names)%>%
    select_if(~sum(.) == 0)%>%
    names()

  df%>%
    filter(rowTally <= skuRowTally,
           sku > search_sku
           # , between(totalUnitSales, 0.7 * search_df$totalUnitSales, 1.3 * search_df$totalUnitSales)
    )%>%
    filter_at(vars(current_parts), all_vars(. == 0))%>%
    filter_at(vars(non_current_parts), any_vars(. != 0))%>%
    rowwise()%>%
    transmute(sku_search = search_sku,
              sku = sku,
              propSum = propSum + search_df$propSum,
              minLB = min(totalUnitSales, search_df$totalUnitSales),
              weightedPrice = (dollarsPerRobot + search_df$dollarsPerRobot) / 2,
              total_parts = rowTally + search_df$rowTally,
              complete_robot = if_else(total_parts == cols, 'COMPLETE', 'incomplete')
    )%>%
    ungroup()
}

A<- as_tibble(A)%>%
  mutate(propSum = rowSums(.[, part_names]))

#I tried running 1:n with 80,000 rows. It wasn't pretty

bind_rows(lapply(1:n, function(x) frankensteinRobot(A, x, 3)))

编辑:这是尝试使用data.table解决方案的尝试。它有一些相似之处,但它不是一个循环,而是绕一圈。如果我能弄清楚如何获得没有匹配零件的主要条件,那可能不会太破旧。现在的瓶颈是内存,这是我无法相交以处理列表列表的原因。

edit: here's an attempt at a data.table solution. It's got some similarities but instead of doing it as a loop, it's one go around. If I could figure out how to get your main condition of no matching parts, it probably wouldn't be too shabby. Right now the bottleneck is memory and this as I can't get intersect to work on my list of lists.

results[
apply(results[, .(current_parts, rbt_missing_curr_parts)], 1, function(x) length(intersect(x[[1]], x[[2]]))==0)
]

主代码:

library(data.table)

dt <- as.data.table(A)

dt[
  ,j = `:=`(propSum = rowSums(.SD),
           current_parts = list(which(.SD > 0)),
           missing_parts = list(which(.SD == 0)))
  ,.SDcols = part_names,
  by = sku]

#could subset here as dt[1:100, ...] which would allow bigger datasets
dt_missing_parts <- dt[, .( sku, propSum, current_parts, rowTally, missing_parts, dollarsPerRobot, up_range = 1.3 *totalUnitSales, low_range = 0.7 * totalUnitSales)]

results<- dt_missing_parts[dt[rowTally <= round(cols / 2)],
                 j = .(i.sku, sku,
                       propSum = propSum + i.propSum, 
                       dollarsPerRobot = (dollarsPerRobot + i.dollarsPerRobot) / 2,
                       totalUnitSales = pmin(totalUnitSales, i.totalUnitSales),
                       rbt_missing_curr_parts = i.current_parts, 
                       current_parts,
                       rpt_missing_missing_parts= i.missing_parts,
                       missing_parts,
                       total_parts = rowTally + i.rowTally),
                 on = .(sku > sku
                        #more conditions would be great
                        # ,low_range < totalUnitSales
                        # ,up_range > totalUnitSales
                        ),
                 allow.cartesian = TRUE,
                 nomatch = 0L,
                 by = .I
                 ]
results
results[
apply(results[, .(current_parts, rbt_missing_curr_parts)], 1, function(x) length(intersect(x[[1]], x[[2]]))==0)
]

这篇关于R,dplyr:可根据条件快速构建互补行列表的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆