将集合分为n个不相等的子集,关键决定因素是该子集中的元素聚合并等于预定量吗? [英] Split a set into n unequal subsets with the key deciding factor being that the elements in the subset aggregate and equal a predetermined amount?

查看:50
本文介绍了将集合分为n个不相等的子集,关键决定因素是该子集中的元素聚合并等于预定量吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一组数字,并希望通过集合划分将它们分成子集.关于如何生成这些子集的决定因素将是确保子集中所有元素的总和尽可能接近由预定分布生成的数量.子集的大小不必相同,每个元素只能在一个子集中.之前我已经通过贪婪算法(

通过简单的数据切割,我们可以看到拟合度已经很好,但是通过将适当大小的元素从过大的纸箱移动到过小的纸箱,我们可以做得更好.我们会反复进行此操作,直到没有更多的动作可以改善我们的健身水平为止.我们使用两个嵌套的 while 循环,这应该使我们担心计算时间,但是我们从紧密匹配开始,因此在循环停止之前我们不应移动太多:

  move_elements<-函数(df,dist){ignore_max =长度(dist);while(ignore_max> 0){ignore_min = 1match_found =假while(ignore_min< ignore_max){group_diffs<-排序(tapply(df $ data,df $ group,sum)-dist * sum(df $ data))group_diffs<-group_diffs [ignore_min:ignore_max]too_big<-which.max(group_diffs)too_small<-which.min(group_diffs)swap_size<--(group_diffs [too_big]-group_diffs [too_small])/2which_big<-which(df $ group == names(too_big))候选行<-which_big [which.min(abs(swap_size-df [which_big,1]))]if(df $ data [candidate_row]< 2 * swap_size){df $ group [candidate_row]<-名称(太小)ignore_max<-长度(dist)match_found<-TRUE休息}别的{ignore_min<-ignore_min + 1}}如果(match_found == FALSE)ignore_max<-ignore_max-1}返回(df)} 

让我们看看已经完成了什么:

  cut_elements(j,dist)%>%move_elements(dist)%&%;%compare_to_distribution(dist,title =剪切和移动") 

现在您可以看到匹配是如此接近,我们正在努力查看目标数据与分区数据之​​间是否存在任何差异.这就是为什么我们需要用GOF进行数值测量的原因.

不过,让我们通过在列之间进行交换元素来对其进行微调,以使其尽可能合适.此步骤在计算上是昂贵的,但是我们仍然在对其进行近似,因此它不需要做很多事情:

  swap_elements<-函数(df,dist){ignore_max =长度(dist);while(ignore_max> 0){ignore_min = 1match_found =假while(ignore_min< ignore_max){group_diffs<-排序(tapply(df $ data,df $ group,sum)-dist * sum(df $ data))too_big<-which.max(group_diffs)too_small<-which.min(group_diffs)current_excess<-group_diffs [too_big]current_defic<-group_diffs [too_small]current_ss<-current_excess ^ 2 + current_defic ^ 2all_pairs<-expand.grid(df $ data [df $ group == names(too_big)],df $ data [df $ group == names(too_small)])all_pairs $ diff<-all_pairs [,1]-all_pairs [,2]all_pairs $ resultant_big<-当前余额-all_pairs $ diffall_pairs $ resultant_small<-current_defic + all_pairs $ diffall_pairs $ sum_sq<-all_pairs $ resultant_big ^ 2 + all_pairs $ resultant_small ^ 2改进<-其中(all_pairs $ sum_sq< current_ss)if(长度(改进)> 0){swap_this<-改进[which.min(all_pairs $ sum_sq [improvements])]r1<--which(df $ data == all_pairs [swap_this,1]& df $ group == names(too_big))[1]r2<--which(df $ data == all_pairs [swap_this,2]& df $ group == names(too_small))[1]df $ group [r1]<-名称(太小)df $ group [r2]<-名称(too_big)ignore_max<-长度(dist)match_found<-TRUE休息}否则ignore_min<-ignore_min + 1}如果(match_found == FALSE)ignore_max<-ignore_max-1}返回(df)} 

让我们看看已经完成了什么:

  cut_elements(j,dist)%>%move_elements(dist)%&%;%swap_elements(dist)%>%compare_to_distribution(dist,title =剪切,移动和交换") 

非常接近相同.让我们量化一下:

  tapply(df $ data,df $ group,sum)/sum(j)#(0,0.3](0.3,0.5](0.5,0.6](0.6,0.65](0.65,0.715](0.715,0.9]#0.30000025 0.20000011 0.10000014 0.05000010 0.06499946 0.18500025#(0.9,1]#0.09999969 

因此,我们有一个非常接近的匹配项:每个分区与目标分布之间的距离不到一百万分之一.考虑到我们只有500个测量值放入7个料箱中,这非常令人印象深刻.

在检索数据方面,我们没有触及数据框 df j 的顺序:

  all(df $ data == j)#[1]是 

和所有分区都包含在 df $ group 中.因此,如果我们希望单个函数仅返回给定 dist j 的分区,则可以执行以下操作:

  partition_to_distribution<-函数(数据,分布){cut_elements(数据,分布)%&%move_elements(distribution)%&%;%swap_elements(分布)%&%;%`[`(,2)} 

总而言之,我们创建了一种算法,该算法可以创建非常接近的匹配.但是,如果运行时间太长,那就不好了.让我们对其进行测试:

 <代码> microbenchmark :: microbenchmark(partition_to_distribution(j,dist),times = 100)#单位:毫秒#expr min lq平均中位数uq#partition_to_distribution(j,dist)47.23613 47.56924 49.95605 47.78841 52.60657#最大海军#93.00016 100 

仅50毫秒即可容纳500个样本.对于大多数应用来说似乎足够好.它会随着较大的样本呈指数增长(在我的PC上大约为10,000个样本,大约为10秒),但是到那时,样本的相对精细度意味着 cut_elements%&%;%move_elements 已经为您提供了对数和小于-30的平方,因此如果不对 swap_elements 进行微调,那将是一个非常好的匹配.10,000个样本仅需30 ms.

I am looking towards a set of numbers and aiming to split them into subsets via set partitioning. The deciding factor on how these subsets will be generated will be ensuring that the sum of all the elements in the subset is as close as possible to a number generated by a pre-determined distribution. The subsets need not be the same size and each element can only be in one subset. I had previously been given guidance on this problem via the greedy algorithm (Link here), but I have found that some of the larger numbers in the set drastically skewed the results. I would therefore like to use some form of set partitioning for this problem.

A deeper underlying issue, which I would really like to correct for future problems, is I find I am drawn to the "brute force" approach with these type of problems. (As you can see from my code below which attempts to use folds to solve the problem via "brute force"). This is obviously a completely inefficient way to tackle the problem, and so I would like to tackle these minimization type problems with a more intelligent approach going forward. Therefore any advice is greatly appreciated.

library(groupdata2)
library(dplyr)

set.seed(345)
j <- runif(500,0,10000000)
dist <- c(.3,.2,.1,.05,.065,.185,.1)
s_diff <- 9999999999

for (i in 1:100) {
    x <- fold(j, k = length(dist), method = "n_rand")

    if (abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)])) < abs(s_diff)) {
        s_diff <- abs(sum(j) * dist[1] - sum(j[which(x$.folds==1)]))
        x_fin <- x
    }
}

This is just a simplified version only looking at the first ‘subset’. s_diff would be the smallest difference between the theoretical and actual results simulated, and x_fin would be which subset each element would be in (ie which fold it corresponds to). I was then looking to remove the elements that fell into the first subset and continue from there, but I know my method is inefficient.

Thanks in advance!

解决方案

This is not a trivial problem, as you will probably gather from the complete lack of answers at 10 days, even with a bounty. As it happens, I think it is a great problem for thinking about algorithms and optimizations, so thanks for posting.

The first thing I would note is that you are absolutely right that this is not the kind of problem with which to try brute force. You may get near to a correct answer, but with a non-trivial number of samples and distribution points, you won't find the optimum solution. You need an iterative approach that moves elements about only if they make the fit better, and the algorithm needs to stop when it can't make it any better.

My approach here is to split the problem into three stages:

  1. Cut the data into approximately the correct bins as a first approximation
  2. Move elements from the bins that are a bit too big to the ones that are a bit too small. Do this iteratively until no more moves will optimize the bins.
  3. Swap the elements between columns to fine tune the fit, until the swaps are optimal.

The reason to do it in this order is that each step is computationally more expensive, so you want to pass a better approximation to each step before letting it do its thing.

Let's start with a function to cut the data into approximately the correct bins:

cut_elements <- function(j, dist)
{
  # Specify the sums that we want to achieve in each partition
  partition_sizes <- dist * sum(j)

  # The cumulative partition sizes give us our initial cuts
  partitions <- cut(cumsum(j), cumsum(c(0, partition_sizes)))

  # Name our partitions according to the given distribution
  levels(partitions) <- levels(cut(seq(0,1,0.001), cumsum(c(0, dist))))

  # Return our partitioned data as a data frame.
  data.frame(data = j, group = partitions)
}

We want a way to assess how close this approximation (and subsequent approximations) are to our answer. We can plot against the target distribution, but it will also be helpful to have a numerical figure to assess the goodness of fit to include on our plot. Here, I will use the sum of the squares of the differences between the sample bins and the target bins. We'll use the log to make the numbers more comparable. The lower the number, the better the fit.

library(dplyr)
library(ggplot2)
library(tidyr)

compare_to_distribution <- function(df, dist, title = "Comparison")
{
  df                                             %>%
  group_by(group)                                %>%
  summarise(estimate = sum(data)/sum(j))         %>%
  mutate(group = factor(cumsum(dist)))           %>%
  mutate(target = dist)                          %>%
  pivot_longer(cols = c(estimate, target))        ->
  plot_info

  log_ss <- log(sum((plot_info$value[plot_info$name == "estimate"] -
                     plot_info$value[plot_info$name == "target"])^2))

  ggplot(data = plot_info, aes(x = group, y = value, fill = name)) +
  geom_col(position = "dodge") +
  labs(title = paste(title, ": log sum of squares =", round(log_ss, 2)))
}

So now we can do:

cut_elements(j, dist) %>% compare_to_distribution(dist, title = "Cuts only")

We can see that the fit is already pretty good with a simple cut of the data, but we can do a lot better by moving appropriately sized elements from the over-sized bins to the under-sized bins. We do this iteratively until no more moves will improve our fit. We use two nested while loops, which should make us worry about computation time, but we have started with a close match, so we shouldn't get too many moves before the loop stops:

move_elements <- function(df, dist)
{
  ignore_max = length(dist);
  while(ignore_max > 0)
  {
    ignore_min = 1
    match_found = FALSE
    while(ignore_min < ignore_max)
    {
      group_diffs   <- sort(tapply(df$data, df$group, sum) - dist*sum(df$data))
      group_diffs   <- group_diffs[ignore_min:ignore_max]
      too_big       <- which.max(group_diffs)
      too_small     <- which.min(group_diffs)
      swap_size     <- (group_diffs[too_big] - group_diffs[too_small])/2
      which_big     <- which(df$group == names(too_big))
      candidate_row <- which_big[which.min(abs(swap_size - df[which_big, 1]))]

      if(df$data[candidate_row] < 2 * swap_size)
      {
        df$group[candidate_row] <- names(too_small)
        ignore_max <- length(dist)
        match_found <- TRUE
        break
      }
      else
      {
        ignore_min <- ignore_min + 1
      }
    }
    if (match_found == FALSE) ignore_max <- ignore_max - 1
  }
  return(df)
}

Let's see what that has done:

cut_elements(j, dist) %>% 
move_elements(dist)   %>%
compare_to_distribution(dist, title = "Cuts and moves")

You can see now that the match is so close we are struggling to see whether there is any difference between the target and the partitioned data. That's why we needed the numerical measure of GOF.

Still, let's get this fit as good as possible by swapping elements between columns to fine-tune them. This step is computationally expensive, but again we are already giving it a close approximation, so it shouldn't have much to do:

swap_elements <- function(df, dist)
{
  ignore_max = length(dist);
  while(ignore_max > 0)
  {
    ignore_min = 1
    match_found = FALSE
    while(ignore_min < ignore_max)
    {
      group_diffs    <- sort(tapply(df$data, df$group, sum)  - dist*sum(df$data))
      too_big        <- which.max(group_diffs)
      too_small      <- which.min(group_diffs)
      current_excess <- group_diffs[too_big]
      current_defic  <- group_diffs[too_small]
      current_ss     <- current_excess^2 + current_defic^2
      all_pairs      <- expand.grid(df$data[df$group == names(too_big)],
                                    df$data[df$group == names(too_small)])
      all_pairs$diff <- all_pairs[,1] - all_pairs[,2]
      all_pairs$resultant_big <- current_excess - all_pairs$diff
      all_pairs$resultant_small <- current_defic + all_pairs$diff
      all_pairs$sum_sq <- all_pairs$resultant_big^2 + all_pairs$resultant_small^2
      improvements   <- which(all_pairs$sum_sq < current_ss)
      if(length(improvements) > 0)
      {
        swap_this <- improvements[which.min(all_pairs$sum_sq[improvements])]
        r1 <- which(df$data == all_pairs[swap_this, 1] & df$group == names(too_big))[1]
        r2 <- which(df$data == all_pairs[swap_this, 2] & df$group == names(too_small))[1]
        df$group[r1] <- names(too_small)
        df$group[r2] <- names(too_big)
        ignore_max <- length(dist)
        match_found <- TRUE
        break
      }
      else ignore_min <- ignore_min + 1
    }
    if (match_found == FALSE) ignore_max <- ignore_max - 1
  }
  return(df)
}

Let's see what that has done:

cut_elements(j, dist) %>% 
move_elements(dist)   %>%
swap_elements(dist)   %>%
compare_to_distribution(dist, title = "Cuts, moves and swaps")

Pretty close to identical. Let's quantify that:

tapply(df$data, df$group, sum)/sum(j)
#     (0,0.3]    (0.3,0.5]    (0.5,0.6]   (0.6,0.65] (0.65,0.715]  (0.715,0.9] 
#  0.30000025   0.20000011   0.10000014   0.05000010   0.06499946   0.18500025 
#     (0.9,1] 
#  0.09999969

So, we have an exceptionally close match: each partition is less than one part in one million away from the target distribution. Quite impressive considering we only had 500 measurements to put into 7 bins.

In terms of retrieving your data, we haven't touched the ordering of j within the data frame df:

all(df$data == j)
# [1] TRUE

and the partitions are all contained within df$group. So if we want a single function to return just the partitions of j given dist, we can just do:

partition_to_distribution <- function(data, distribution)
{
  cut_elements(data, distribution) %>% 
  move_elements(distribution)      %>%
  swap_elements(distribution)      %>%
  `[`(,2)
}

In conclusion, we have created an algorithm that creates an exceptionally close match. However, that's no good if it takes too long to run. Let's test it out:

microbenchmark::microbenchmark(partition_to_distribution(j, dist), times = 100)
# Unit: milliseconds
#                                expr      min       lq     mean   median       uq
#  partition_to_distribution(j, dist) 47.23613 47.56924 49.95605 47.78841 52.60657
#       max neval
#  93.00016   100

Only 50 milliseconds to fit 500 samples. Seems good enough for most applications. It would grow exponentially with larger samples (about 10 seconds on my PC for 10,000 samples), but by that point the relative fineness of the samples means that cut_elements %>% move_elements already gives you a log sum of squares of below -30 and would therefore be an exceptionally good match without the fine tuning of swap_elements. These would only take about 30 ms with 10,000 samples.

这篇关于将集合分为n个不相等的子集,关键决定因素是该子集中的元素聚合并等于预定量吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆