蛮力算法可以扩展吗? [英] Can brute force algorithms scale?

查看:33
本文介绍了蛮力算法可以扩展吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数学问题,我通过反复试验解决了这个问题(我认为这被称为蛮力),当有几个选项时,程序运行良好,但是随着我添加更多变量/数据,它需要的时间越来越长运行.

我的问题是,虽然原型可以工作,但它对数千个变量和大数据集很有用;所以,我想知道是否可以扩展蛮力算法.我该如何扩展它?

我开始学习和使用 Hadoop(以及 HBase);虽然看起来很有希望,但我想验证一下我正在尝试做的事情并非不可能.

如果有帮助,我用 Java 编写了程序(如果可能的话,可以使用它),但最终将它移植到 Python,因为我觉得它更舒服.

更新:为了提供更多见解,我想我会添加代码的简化版本来理解这个想法.基本上,如果我知道总和是 100,我会尝试找到可能等于它的变量的所有组合.这很简单,在我的版本中,我可能会使用更大的数字和更多的变量.这是丢番图,我相信没有任何算法可以在不使用蛮力的情况下解决它.

int sum = 100;int a1 = 20;int a2 = 5;int a3 = 10;for (int i = 0; i * a1 <= sum; i++) {for (int j = 0; i * a1 + j * a2 <= sum; j++) {for (int k = 0; i * a1 + j * a2 + k * a3 <= sum; k++) {if (i * a1 + j * a2 + k * a3 == sum) {System.out.println(i + "," + j + "," + k);}}}}

我是编程新手,如果我没有正确地提出这个问题,我很抱歉.这更像是一个笼统的问题.

解决方案

通常,您可以通过使用 big-O 符号来分析算法的增长率来量化算法的可扩展性.当你说你的算法通过蛮力"工作时,不清楚它会扩展到什么程度.如果您的蛮力"解决方案通过列出一组数据的所有可能的子集或组合来工作,那么它几乎肯定不会扩展(它将具有渐近复杂度 O(2n) 或 O(n!), 分别).如果您的蛮力解决方案通过查找所有元素对并检查每个元素来工作,那么它可能会很好地扩展(O(n2)).但是,如果没有关于您的算法如何工作的更多信息,就很难说.

您可能想看看这篇关于 big-O 的优秀文章作为如何推理程序长期可扩展性的起点.通常来说,任何增长率为 O(n log n)、O(n)、O(log n) 或 O(1) 的东西都能很好地扩展,任何增长率为 O(n2) 或 O(n3) 将扩展到一个点,任何增长率为 O(2n) 或更高的都不会扩展.

另一种选择是查找您正在尝试解决的问题,以了解它的研究情况.众所周知,有些问题有很好的解决方案,如果您是其中之一,那么可能值得看看其他人提出了什么.也许有一个非常干净的非暴力解决方案,可以很好地扩展!推测其他一些问题根本没有可扩展的算法(所谓的NP 难题).如果是这种情况,那么您应该非常确信没有办法获得可扩展的方法.

最后,您可以随时在 Stack Overflow 上提出一个新问题,描述您正在尝试做什么并征求意见.也许社区可以比您最初预期的更有效地帮助您解决问题!

根据您尝试解决的问题的描述,现在您正在为每个变量执行一个 for 循环,从 0 到您尝试定位的数字.该算法的复杂度为 O(Uk),其中 k 是变量的数量,U 是总和.这种方法根本不能很好地扩展.在上面的例子中引入每个新变量会使算法运行速度慢 100 倍,如果你想要 100 个变量,这肯定不会很好地扩展!

但是,我认为有一个相当不错的算法,其运行时间为 O(U2k) 使用 O(Uk) 内存来解决问题.直觉是这样的:假设我们想把 1、2、4 相加得到 10.有很多方法可以做到这一点:

2 * 4 + 1 * 2 + 0 * 12 * 4 + 0 * 2 + 2 * 11 * 4 + 3 * 2 + 0 * 11 * 4 + 2 * 2 + 2 * 11 * 4 + 1 * 2 + 4 * 11 * 4 + 0 * 2 + 6 * 10 * 4 + 5 * 2 + 0 * 10 * 4 + 4 * 2 + 2 * 10 * 4 + 3 * 2 + 4 * 10 * 4 + 2 * 2 + 6 * 10 * 4 + 1 * 2 + 8 * 10 * 4 + 0 * 2 + 10 * 1

关键的观察是我们可以将所有这些写成总和,但更重要的是,作为总和中的每一项不大于前一项的总和:

2 * 4 + 1 * 2 + 0 * 1 = 4 + 4 + 22 * 4 + 0 * 2 + 2 * 1 = 4 + 4 + 1 + 11 * 4 + 3 * 2 + 0 * 1 = 4 + 2 + 2 + 21 * 4 + 2 * 2 + 2 * 1 = 4 + 2 + 2 + 1 + 11 * 4 + 1 * 2 + 4 * 1 = 4 + 2 + 1 + 1 + 1 + 11 * 4 + 0 * 2 + 6 * 1 = 4 + 1 + 1 + 1 + 1 + 1 + 10 * 4 + 5 * 2 + 0 * 1 = 2 + 2 + 2 + 2 + 20 * 4 + 4 * 2 + 2 * 1 = 2 + 2 + 2 + 2 + 1 + 10 * 4 + 3 * 2 + 4 * 1 = 2 + 2 + 2 + 1 + 1 + 1 + 10 * 4 + 2 * 2 + 6 * 1 = 2 + 2 + 1 + 1 + 1 + 1 + 1 + 10 * 4 + 1 * 2 + 8 * 1 = 2 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 10 * 4 + 0 * 2 + 10 * 1 = 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1

所以这给出了一个有趣的想法,即如何生成所有可能的方法来总结目标.这个想法是固定第一个系数,然后生成所有可能的方法来使和的其余部分计算出来.换句话说,我们可以递归地思考这个问题.如果我们按照 x1, x2, ..., xn 的顺序列出变量,那么我们可以尝试固定一些特定的系数对于 x1,然后仅使用 x2, ..., xn 解决 sum - c_1 x_1 求和的问题.

到目前为止,这似乎并不是那么花哨 - 事实上,这正是您在上面所做的 - 但我们可以使用一个技巧.只要我们要递归地思考这个问题,让我们以相反的方式思考这个问题.与其从 sum 开始并试图分解它,不如我们从 0 开始并尝试构建我们可以构建的所有内容呢?

这就是想法.假设我们已经预先知道我们可以使用 x1 的总和得出的所有数字.然后,对于 0 和 sum 之间的每个数字 k,包括在内,我们可以从 x2 和 x1 的任意组合中生成 k,其中 k- c2 x2 是可以由 x1 组合而成的东西.但是因为我们已经预先计算了这个,我们可以迭代 c2 的所有可能的合法值,计算 k - c2 x2,看看我们是否知道如何制作它.假设我们存储了一个巨大的 U x (k + 1) 布尔值表,这样表条目 [x, y] 存储我们可以总结第一个 y 值,包括,以精确地总结为 U 的方式?,"我们可以有效地填写表格.这称为动态规划,是一种强大的算法工具.

更具体地说,这可能是如何工作的.给定 k 个变量,创建一个 U x (k + 1) 值表 T.然后,为所有 x > 0 设置 T[0][0] = true 和 T[x][0] = false.这里的基本原理是 T[0][0] 表示我们可以使用第一个零变量的线性组合?"答案肯定是肯定的(空和为零!),但是对于由 no 组成的任何其他和,没有变量的线性组合,我们绝对无法做到.

现在,对于 i = 1 .. k,我们将尝试填充 T[x][i] 的值.还记得 T[x][i] 的意思是我们可以让 x 作为前 i 个变量的线性组合吗?"好吧,我们知道我们可以这样做,如果有一些系数 c 使得 k - cxi 可以使用 x1, x 的线性组合2, ..., xi - 1.但是对于任何 c,这只是 T[x - c xi][i - 1] 是否为真.因此我们可以说

for i = 1 to k对于 z = 0 求和:对于 c = 1 到 z/x_i:如果 T[z - c * x_i][i - 1] 为真:设置 T[z][i] 为真

检查循环,我们看到外循环运行 k 次,内循环每次迭代运行 sum 次,最内循环也最多运行 sum 次每次迭代.他们的产品是(使用我们上面的符号)O(U2 k),这比你最初使用的 O(Uk) 算法要好得多.>

但是你如何使用这些信息来列出所有可能的方法来总结目标?这里的诀窍是要意识到您可以使用该表来避免在搜索每个可能的组合时浪费大量精力,因为其中许多组合都不起作用.

让我们看一个例子.假设我们已经完全计算了这个表并且想要列出所有解决方案.一个想法是考虑列出最后一个变量的系数为零的所有解决方案,然后当最后一个变量为 1 时,等等.您之前使用的方法的问题是对于某些系数可能根本没有任何解决方案.但是使用我们上面构建的表,我们可以修剪掉那些分支.例如,假设我们想看看是否有任何以 xk 开头且系数为 0 的解.这意味着我们要问是否有任何方法可以求和前 k - 1 个变量,因此这些值的总和是 sum.当且仅当 T[sum][k - 1] 为真时,这是可能的.如果它是真的,那么我们可以递归地尝试以总和为 sum 的方式将系数分配给其余的值.如果不是,那么我们跳过这个系数并继续下一个.

递归地,这看起来像这样:

function RecursivelyListAllThatWork(k, sum)//使用最后 k 个变量,使 sum/* 基本情况:如果我们已经正确分配了所有变量,请列出这个* 解决方案.*/如果 k == 0:打印到目前为止我们所拥有的返回/* 递归步骤:尝试所有系数,但前提是它们有效.*/对于 c = 0 求和/x_k:如果 T[sum - c * x_k][k - 1] 为真:将 x_k 的系数标记为 c调用 RecursivelyListAllThatWork(k - 1, sum - c * x_k)取消标记 x_k 的系数

这将递归地列出所有有效的解决方案,使用我们刚刚构建的表中的值来跳过大量浪费的工作.一旦你建立了这个表格,你就可以通过将任务分给多台计算机来分配这项工作,让它们每台列出总解决方案的一个子集,并并行处理它们.

希望这有帮助!

I have a math problem that I solve by trial and error (I think this is called brute force), and the program works fine when there are a few options, but as I add more variables/data it takes longer and longer to run.

My problem is although, the prototype works, it is useful with thousands of variables and large data sets; so, I'm wondering if it is possible to scale brute force algorithms. How can I approach scaling it?

I was starting to learn and play around with Hadoop (and HBase); although it looks promising, I wanted to verify that what I'm trying to do isn't impossible.

If it helps, I wrote the program in Java (and can use it if possible), but ended up porting it to Python, because I feel more comfortable with it.

Update: To provide more insight, I think I'll add a simplified version of the code to get the idea. Basically if I know the sum is 100, I am trying to find all combinations of the variables that could equal it. This is simple, in my version I may use larger numbers and many more variables. It's the Diophantine, and I believe there is no algorithm that exists to solve it without brute force.

int sum = 100;
int a1 = 20;
int a2 = 5;
int a3 = 10;
for (int i = 0; i * a1 <= sum; i++) {
    for (int j = 0; i * a1 + j * a2 <= sum; j++) {
        for (int k = 0; i * a1 + j * a2 + k * a3 <= sum; k++) {
            if (i * a1 + j * a2 + k * a3 == sum) {
              System.out.println(i + "," + j + "," + k);
            }
        }
    }   
}

I am new to programming, and I am sorry if I'm not framing this question correctly. This is more of a general question.

解决方案

Typically, you can quantify how well an algorithm will scale by using big-O notation to analyze its growth rate. When you say that your algorithm works by "brute force," it's unclear to what extent it will scale. If your "brute force" solution works by listing all possible subsets or combinations of a set of data, then it almost certainly will not scale (it will have asymptotic complexity O(2n) or O(n!), respectively). If your brute force solution works by finding all pairs of elements and checking each, it may scale reasonably well (O(n2)). Without more information about how your algorithm works, though, it's difficult to say.

You may want to look at this excellent post about big-O as a starting point for how to reason about the long-term scalablility of your program. Typically speaking, anything that has growth rate O(n log n), O(n), O(log n), or O(1) scale extremely well, anything with growth rate O(n2) or O(n3) will scale up to a point, and anything with growth rate O(2n) or higher will not scale at all.

Another option would be to look up the problem you're trying to solve to see how well-studied it is. Some problems are known to have great solutions, and if yours is one of them it might be worth seeing what others have come up with. Perhaps there is a very clean, non-brute-force solution that scales really well! Some other problems are conjectured to have no scalable algorithms at all (the so-called NP-hard problems). If that's the case, then you should be pretty confident that there's no way to get a scalable approach.

And finally, you can always ask a new question here at Stack Overflow describing what you're trying to do and asking for input. Maybe the community can help you solve your problem more efficiently than you initially expected!

EDIT: Given the description of the problem that you are trying to solve, right now you are doing one for loop per variable from 0 up to the number you're trying to target. The complexity of this algorithm is O(Uk), where k is the number of variables and U is the sum. This approach will not scale very well at all. Introducing each new variable in the above case will make the algori2thm run 100 times slower, which definitely will not scale very well if you want 100 variables!

However, I think that there is a fairly good algorithm whose runtime is O(U2k) that uses O(Uk) memory to solve the problem. The intuition is as follows: Suppose that we want to sum up 1, 2, and 4 to get 10. There are many ways to do this:

2 * 4 +  1 * 2 +  0 * 1
2 * 4 +  0 * 2 +  2 * 1
1 * 4 +  3 * 2 +  0 * 1
1 * 4 +  2 * 2 +  2 * 1
1 * 4 +  1 * 2 +  4 * 1
1 * 4 +  0 * 2 +  6 * 1
0 * 4 +  5 * 2 +  0 * 1
0 * 4 +  4 * 2 +  2 * 1
0 * 4 +  3 * 2 +  4 * 1
0 * 4 +  2 * 2 +  6 * 1
0 * 4 +  1 * 2 +  8 * 1
0 * 4 +  0 * 2 + 10 * 1

The key observation is that we can write all of these out as sums, but more importantly, as sums where each term in the sum is no greater than the previous term:

2 * 4 +  1 * 2 +  0 * 1 = 4 + 4 + 2
2 * 4 +  0 * 2 +  2 * 1 = 4 + 4 + 1 + 1
1 * 4 +  3 * 2 +  0 * 1 = 4 + 2 + 2 + 2
1 * 4 +  2 * 2 +  2 * 1 = 4 + 2 + 2 + 1 + 1
1 * 4 +  1 * 2 +  4 * 1 = 4 + 2 + 1 + 1 + 1 + 1
1 * 4 +  0 * 2 +  6 * 1 = 4 + 1 + 1 + 1 + 1 + 1 + 1
0 * 4 +  5 * 2 +  0 * 1 = 2 + 2 + 2 + 2 + 2
0 * 4 +  4 * 2 +  2 * 1 = 2 + 2 + 2 + 2 + 1 + 1
0 * 4 +  3 * 2 +  4 * 1 = 2 + 2 + 2 + 1 + 1 + 1 + 1
0 * 4 +  2 * 2 +  6 * 1 = 2 + 2 + 1 + 1 + 1 + 1 + 1 + 1
0 * 4 +  1 * 2 +  8 * 1 = 2 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1
0 * 4 +  0 * 2 + 10 * 1 = 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1

So this gives an interesting idea about how to generate all possible ways to sum up to the target. The idea is to fix the first coefficient, then generate all possible ways to make the rest of the sum work out. In other words, we can think about the problem recursively. If we list the variables in order as x1, x2, ..., xn, then we can try fixing some particular coefficient for x1, then solving the problem of summing up sum - c_1 x_1 using just x2, ..., xn.

So far this doesn't seem all that fancy - in fact, it's precisely what you're doing above - but there is one trick we can use. As long as we're going to be thinking about this problem recursively, let's think about the problem in the opposite manner. Rather than starting with sum and trying to break it down, what if instead we started with 0 and tried to build up everything that we could?

Here's the idea. Suppose that we already know in advance all the numbers we can make using just sums of x1. Then for every number k between 0 and sum, inclusive, we can make k out of x2 and x1 out of any combination where k - c2 x2 is something that can be made out of combinations of x1. But since we've precomputed this, we can just iterate up over all possible legal values of c2, compute k - c2 x2, and see if we know how to make it. Assuming we store a giant U x (k + 1) table of boolean values such that table entry [x, y] stores "can we sum up the first y values, inclusive, in a way that sums up to precisely U?," we can fill in the table efficiently. This is called dynamic programming and is a powerful algorithmic tool.

More concretely, here's how this might work. Given k variables, create a U x (k + 1) table T of values. Then, set T[0][0] = true and T[x][0] = false for all x > 0. The rationale here is that T[0][0] means "can we get the sum zero using a linear combination of the first zero variables?" and the answer is definitely yes (the empty sum is zero!), but for any other sum made of no a linear combination of no variables we definitely cannot make it.

Now, for i = 1 .. k, we'll try to fill in the values of T[x][i]. Remember that T[x][i] means "can we make x as a linear combination of the first i variables?" Well, we know that we can do this if there is some coefficient c such that k - c xi can be made using a linear combination of x1, x2, ..., xi - 1. But for any c, that's just whether T[x - c xi][i - 1] is true. Thus we can say

for i = 1 to k
    for z = 0 to sum:
        for c = 1 to z / x_i:
            if T[z - c * x_i][i - 1] is true:
                set T[z][i] to true

Inspecting the loops, we see that the outer loop runs k times, the inner loop runs sum times per iteration, and the innermost loop runs also at most sum times per iteration. Their product is (using our notation from above) O(U2 k), which is way better than the O(Uk) algorithm that you had originally.

But how do you use this information to list off all of the possible ways to sum up to the target? The trick here is to realize that you can use the table to avoid wasting a huge amount of effort searching over every possible combination when many of them aren't going to work.

Let's see an example. Suppose that we have this table completely computed and want to list off all solutions. One idea is to think about listing all solutions where the coefficient of the last variable is zero, then when the last variable is one, etc. The issue with the approach you had before is that for some coefficients there might not be any solutions at all. But with the table we have constructed above, we can prune out those branches. For example, suppose that we want to see if there are any solutions that start with xk having coefficient 0. This means that we're asking if there are any ways to sum up a linear combination of the first k - 1 variables so that the sum of those values is sum. This is possible if and only if T[sum][k - 1] is true. If it is true, then we can recursively try assigning coefficients to the rest of the values in a way that sums up to sum. If not, then we skip this coefficient and go on to the next.

Recursively, this looks something like this:

function RecursivelyListAllThatWork(k, sum) // Using last k variables, make sum
    /* Base case: If we've assigned all the variables correctly, list this
     * solution.
     */
    if k == 0:
        print what we have so far
        return

    /* Recursive step: Try all coefficients, but only if they work. */
    for c = 0 to sum / x_k:
       if T[sum - c * x_k][k - 1] is true:
           mark the coefficient of x_k to be c
           call RecursivelyListAllThatWork(k - 1, sum - c * x_k)
           unmark the coefficient of x_k

This recursively will list all the solutions that work, using the values in the table we just constructed to skip a huge amount of wasted effort. Once you've built this table, you could divvy this work up by farming out the task to multiple computers, having them each list a subset of the total solutions, and processing them all in parallel.

Hope this helps!

这篇关于蛮力算法可以扩展吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆