将向量拆分为多个块,以使每个块的总和近似恒定 [英] Split a vector into chunks such that sum of each chunk is approximately constant

查看:56
本文介绍了将向量拆分为多个块,以使每个块的总和近似恒定的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据框,其中有超过10万条记录,对这些值进行了排序

I have a large data frame with more than 100 000 records where the values are sorted

例如,考虑以下虚拟数据集

For example, consider the following dummy data set

df <- data.frame(values = c(1,1,2,2,3,4,5,6,6,7))

我想创建3组上述值(仅按顺序),以使每组的总和大致相同

I want to create 3 groups of above values (in sequence only) such that the sum of each group is more or less the same

因此对于上述组,如果我决定将排序后的 df 分为以下三组,则它们的总和为

So for the above group, if I decide to divide the sorted df in 3 groups as follows, their sums will be

1. 1 + 1 + 2 +2 + 3 + 4 = 13
2. 5 + 6 = 11
3. 6 + 7 = 13

如何在R中创建此优化?有逻辑吗?

How can create this optimization in R? any logic?

推荐答案

因此,让我们使用修剪.我认为其他解决方案可以提供一个好的解决方案,但不是最佳解决方案.

So, let's use pruning. I think other solutions are giving a good solution, but not the best one.

首先,我们要最小化其中S_n是前n个元素的累加和.

First, we want to minimize where S_n is the cumulative sum of the first n elements.

computeD <- function(p, q, S) {
  n <- length(S)
  S.star <- S[n] / 3
  if (all(p < q)) {
    (S[p] - S.star)^2 + (S[q] - S[p] - S.star)^2 + (S[n] - S[q] - S.star)^2
  } else {
    stop("You shouldn't be here!")
  }
}

我认为其他解决方案在p和q上进行独立优化,不会给出全局最小值(对于某些特定情况是预期的).

I think the other solutions optimize over p and q independently, which won't give a global minima (expected for some particular cases).

optiCut <- function(v) {
  S <- cumsum(v)
  n <- length(v)
  S_star <- S[n] / 3
  # good starting values
  p_star <- which.min((S - S_star)^2)
  q_star <- which.min((S - 2*S_star)^2)
  print(min <- computeD(p_star, q_star, S))

  count <- 0
  for (q in 2:(n-1)) {
    S3 <- S[n] - S[q] - S_star
    if (S3*S3 < min) {
      count <- count + 1
      D <- computeD(seq_len(q - 1), q, S)
      ind = which.min(D);
      if (D[ind] < min) {
        # Update optimal values
        p_star = ind;
        q_star = q;
        min = D[ind];
      }
    }
  }
  c(p_star, q_star, computeD(p_star, q_star, S), count)
}

这与其他解决方案一样快,因为它根据条件 S3 * S3<分钟.但是,它提供了最佳解决方案,请参见 optiCut(c(1,2,3,3,5,10)).

This is as fast as the other solutions because it prunes a lot the iterations based on the condition S3*S3 < min. But, it gives the optimal solution, see optiCut(c(1, 2, 3, 3, 5, 10)).

对于K> = 3的解决方案,我基本上用嵌套的小对象重新实现了树,这很有趣!

For the solution with K >= 3, I basically reimplemented trees with nested tibbles, that was fun!

optiCut_K <- function(v, K) {

  S <- cumsum(v)
  n <- length(v)
  S_star <- S[n] / K
  # good starting values
  p_vec_first <- sapply(seq_len(K - 1), function(i) which.min((S - i*S_star)^2))
  min_first <- sum((diff(c(0, S[c(p_vec_first, n)])) - S_star)^2)

  compute_children <- function(level, ind, val) {

    # leaf
    if (level == 1) {
      val <- val + (S[ind] - S_star)^2
      if (val > min_first) {
        return(NULL)
      } else {
        return(val)
      } 
    } 

    P_all <- val + (S[ind] - S[seq_len(ind - 1)] - S_star)^2
    inds <- which(P_all < min_first)
    if (length(inds) == 0) return(NULL)

    node <- tibble::tibble(
      level = level - 1,
      ind = inds,
      val = P_all[inds]
    )
    node$children <- purrr::pmap(node, compute_children)

    node <- dplyr::filter(node, !purrr::map_lgl(children, is.null))
    `if`(nrow(node) == 0, NULL, node)
  }

  compute_children(K, n, 0)
}

这为您提供了所有解决方案最不比贪婪的解决方案:

This gives you all the solution that are least better than the greedy one:

v <- sort(sample(1:1000, 1e5, replace = TRUE))
test <- optiCut_K(v, 9)

您需要取消嵌套:

full_unnest <- function(tbl) {
  tmp <- try(tidyr::unnest(tbl), silent = TRUE)
  `if`(identical(class(tmp), "try-error"), tbl, full_unnest(tmp))
}
print(test <- full_unnest(test))

最后,以获得最佳解决方案:

And finally, to get the best solution:

test[which.min(test$children), ]

这篇关于将向量拆分为多个块,以使每个块的总和近似恒定的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆