有没有好的办法做到这一点类型的挖掘? [英] Is there a good way to do this type of mining?

查看:154
本文介绍了有没有好的办法做到这一点类型的挖掘?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到最接近在X和Y方向(在最后给定的样本数据集)的空间,我想看看是否有更聪明的办法比我微不足道的(和未经测试)的方法来做到这点。这些点在空间中的情节看起来像下面这样,我试图找到套标志着内部的箱点,即我要找的输出是一组集:

 第1组:(1,23),(2,23),(3,23)...
第2组:(68200),(68201),(68203),(68204),(68100),(68101),(68101)...
 

有关横带,我想我可以继续使用尺寸小推拉窗说,5年或10(它确实应该从全球的信息,其中规模将给出最大的分组点,但我很确定还在探索一种很好的方法)并搜索相邻的点,因为休息就不会再被认为是水平带。

我猜测同样的方法适用于垂直带的很好,但不是在所有情况下,因为在水平和垂直带一个微妙的差异:点应该出现接近有一批水平考虑,但它们可以出现在任何地方是的垂直带深思熟虑的一部分。观察大型立式带所示。所以我猜我可能只是寻找那些具有相同的x坐标点(在这种情况下,x = 68)应该给我一分不少。

除了这个平凡的解决方案,我想不出什么聪明的,可以在这里完成,因为这问题似乎看似简单的给我。我失去了一些东西呢?这是否落入一些已知类的问题,如果是这样,有一个很好的和可扩展的方式来实现这一目标?

样本数据集:

  1,23
1,23
2,23
3,23
4,23
5,23
6,23
7,23
8,23
9,23
10,23
11,23
12,23
13,23
14,23
15,23
16,23
10,33
11,33
12,33
13,33
14,33
15,33
16,33
17,33
18,33
19,33
2,28
2,28
3,28
34,75
34,76
34,76
34,77
34,78
34,79
34,80
34,81
34,82
34,83
34,75
34,76
34,76
34,77
34,78
34,79
34,80
34,81
400,28
400,28
400,28
68,200
68201
68203
68204
68,100
68101
68103
68104
 

解决方案

这是晚了一点,但这个问题一直担心我一段时间。一世 为确保它可以与混合整数/线性编程技术解决了 并要求在这个问题上的帮助:标识列和行集群线性规划

然而,得到的答复后有,我有这样的认识你的问题,在 至少根据我的理解,就是这么简单(作为一个约束程序框架) 你可以平凡与简单的程序解决这个问题(你已经 知道)。换句话说,约束规划将是一个很酷的方式来解决 这一点,但是,至少我找到了办法,会给你同样的答案 作为东西要简单得多。

我将我的推理,下面就介绍一下我怎么会用约束实现它 解决包,然后给最终的,琐碎的,算法。

混合整数规划解

的最重要的细节是水平和垂直之间的差 组。至于我可以看到,任何垂直对齐可以在 相同的基团。但水平群体是不同的 - 组件已接近 在一起。

与约束解决问题中最难的部分似乎找到 的方式来描述,该解算器可以理解的方式的限制。我不会 进入这里的细节,但是解算器是令人沮丧的限制。幸运的是我 认为有一种方法可以在这里做到这一点,是要考虑水平 邻居:如果有N个一排,然后我们有 N-1 套 邻居(例如,用4个点的ABC和D有三个双 AB,BC,和CD)。

有关每一对中,我们可以给出分数,这是它们之间的空格数 ( S_I )的一些因素 K ,以及一个标志缩放( F_i ),它是0或1。如果 对处于同一水平组,然后我们设置标志设置为1,否则 是零。

要看到的一组标志为所有对的这是至关重要的完全 定义了一个解决方案的。我们可以在任何行运行,把对具有标志 1在同一水平组,并启动一个新的水平各组 时间标志为0。然后,我们可以采取大小为1的所有水平组和 它们转换成垂直团:任意点不是在一个水平基 必须是在一个垂直组(即使只是一个的竖直基)。

所以,现在我们需要的是一种方法,当然preSS的最佳解决方案中的条款 标志。我建议,我们希望尽量减少:

 和(1  -  F_i)+ SUM(K * S_I * F_i)
 

这有两个方面。第一个是1减去标记的每个的总和 对。该标志为1时的点是在同一水平组和0 除此以外。因此,尽量减少此值等于说,我们要为 几个的水平群成为可能。如果这是唯一的约束,则我们 它通过可设置为零的所有 F_i 1 - 通过对行的所有对 同一组的成员。

但第二项阻止我们选择这样一个极端的解决方案。它 惩罚与群体差距。如果一对在同一个组,但 通过 S_I 空格分隔,那么我们有 K * S_I

一个罚

因此​​,我们有一个权衡。我们希望水平组,但我们不希望空白。 最终的解决方案将取决于 K - 如果是大的话,我们将不包括 在水平组的任何空间。但是,因为它降低,我们将开始做 所以,直到当它是非常小的(趋于零)我们把一切在一排 在一个组中。

要使用这一点,你会选择一些 K ,计算 S_I ,并进入前pression入一个约束机制。该系统将然后选择 F_i 来减少前pression。最后,你会转换 F_i 成组的方式通过扫描如上所述每一行,然后垂直分组单身。

解析解

确定,冬暖夏凉。在这一点上,我们有办法EX preSS的问题​​,我们可以给 一个约束引擎。

不过是微不足道的解决!我们不需要任何讨厌的约束引擎 解决这个问题 - 我们只要看一下前pression:

 和(1  -  F_i)+ SUM(K * S_I * F_i)
 

这两个数额是在同一对,所以我们可以将所有的东西都和:

 和(1  -  F_i + K * S_I * F_i)
总和(1 + F_i *(K * S_I  -  1))
 

然后提取常数( N 这里是对的总数):

  N + SUM(F_i *(K * S_I  -  1))
 

现在注意,在每一项都是独立的(和添加剂)。因此,对于每一个 短期,我们希望的最小值。我们有两个选择:

  • 如果 F_i 0,那么整个期限为0。

  • 否则, F_i 为1,期限 K * S_I - 1

所以最好的选择取决于 K * S_I 是否大于1。如果 K * S_I 大于1,则该术语的最小值为0,而 F_i 应该是0,否则第二志愿以上是负的,而 F_i 应 是一个。

琐碎的算法

这是什么意思?这意味着,对于每一对,我们可以简单地看一下 空格数, S_I 。如果大于 1 / K 那么这两个点 应该是在单独的组。否则,它们应该在相同的基团

所以,这一切花哨的数学和优化,约束和bullshitting 归结为:相距多远都在邻国对两个点?如果他们 比一些停产近​​,把它们放在同一水平组。 否则,把它们放在不同的组。

所以在这里,最后,是你的算法:

 选择一些临界值,X
将每个点在它自己的,单身,水平组
对于每一行与一个以上的点:
    对于每个相邻一对行中:
        如果该对之间的间隔小于X:
            加入到一个单一的水平组
为每列:
    加入任何剩余的单组到一个单一的垂直组
 

结论

  • 可以使用约束编程技术来解决这个问题,但这种技术仅限于描述该系统中的正确的解决方案(通常为直链)的方式

  • 最简单的这样的方法我能找到相当于一个简单的,直接的 算法分点成一排为根据水平群体 它们之间的空格数。

  • 这一切都取决于一大堆假设你想要的东西这可能,当然, 过分简化,或只是简单的错误。

I am trying to find points that are closest in space in X and Y directions (sample dataset given at the end) and am looking to see if there are smarter approaches to do this than my trivial (and untested) approach. The plot of these points in space looks something like the following and am trying to find sets of points marked inside the boxes i.e. the output I am looking for is a set of groups:

Group 1: (1,23), (2,23), (3,23)...
Group 2: (68,200), (68,201), (68,203), (68,204), (68,100), (68,101), (68,101)...

For the horizontal bands, I am thinking I could just go ahead and use small sliding windows of size say, 5 or 10 (which should really be determined from the global information of which size will give the maximum grouped points but I am still exploring a good approach) and search for contiguous points because a break would not be considered a horizontal band anymore.

I am guessing the same approach works for the vertical bands as well but not in all cases because there is a subtle difference in horizontal and vertical bands: points should appear close to be considered a group horizontally but they can appear anywhere to be considered part of a vertical band. Observe the large vertical band in the figure. So I am guessing I could just look for points that have the same x-coordinate (in this case, x=68) should give me a lot of points.

Other than this trivial solution, I can't think of anything smart that can be done here as this problem appears deceptively simple to me. Am I missing something here? Does this fall into some known class of problems and if so, is there a good and scalable approach to achieve this?

Sample Dataset:

1,23
1,23
2,23
3,23
4,23
5,23
6,23
7,23
8,23
9,23
10,23
11,23
12,23
13,23
14,23
15,23
16,23
10,33
11,33
12,33
13,33
14,33
15,33
16,33
17,33
18,33
19,33
2,28
2,28
3,28
34,75
34,76
34,76
34,77
34,78
34,79
34,80
34,81
34,82
34,83
34,75
34,76
34,76
34,77
34,78
34,79
34,80
34,81
400,28
400,28
400,28
68,200
68,201
68,203
68,204
68,100
68,101
68,103
68,104

解决方案

This is a little late, but this problem has been worrying me for some time. I was sure it could be solved with mixed integer / linear programming techniques and asked for help in this question: Identifying column and row clusters with linear programming

However, after getting a reply there, I had the insight that your problem, at least as I understand it, is so simple (when framed as a constraint program) that you can solve it trivially with a simple program (which you already knew). In other words, constraint programming would be a cool way to solve this, but, at least with the approach I found, would give you the same answer as something much simpler.

I'll explain below my reasoning, how I would implement it with a constraint solving package, and then give the final, trivial, algorithm.

Mixed integer programming solution

The most important detail is the difference between horizontal and vertical groups. As far as i can see, anything that aligns vertically can be in the same group. But horizontal groups are different - components have to be close together.

The hardest part of solving a problem with constraints seems to be finding a way to describe the limits in a way that the solver can understand. I won't go into the details here, but solvers are frustratingly limited. Luckily I think there is a way to do this here, and it is to consider horizontal neighbours: if there are N points in a row then we have N-1 sets of neighbours (for example, with 4 points A B C and D there are the three pairs AB, BC, and CD).

For each pair, we can give a score, which is the number of spaces between them (S_i) scaled by some factor K, and a flag (F_i) which is 0 or 1. If the pair are in the same horizontal group then we set the flag to 1, otherwise it is zero.

It is critical to see that the set of flags for all the pairs completely defines a solution. We can run across any row, placing pairs with a flag of 1 in the same horizontal group, and starting a new horizontal group each time the flag is 0. Then, we can take all horizontal groups of size 1 and convert them into vertical groups: any point that is not in a horizontal group must be in a vertical group (even if it is a vertical group of just one).

So all we need now is a way to express an optimal solution in terms of the flags. I suggest that we want to minimise:

sum(1 - F_i) + sum(K * S_i * F_i)

This has two terms. The first is the sum of "one minus the flag" for each pair. The flag is 1 when the points are in the same horizontal group and 0 otherwise. So minimising this value is the same as saying that we want as few horizontal groups as possible. If this was the only constraint then we could set it to zero by making all the F_i 1 - by making all pairs on a row members of the same group.

But the second term stops us from choosing such an extreme solution. It penalises groups with gaps. If a pair are in the same group, but are separated by S_i spaces, then we have a "penalty" of K * S_i.

So we have a trade-off. We want horizontal groups, but we don't want gaps. The final solution will depend on K - if it is large then we won't include any spaces in horizontal groups. But as it is decreased we will start to do so, until when it is very small (tends to zero) we place everything in a row in a single group.

To use this you would choose some K, calculate the S_i, and enter the expression above into a constraint system. The system would then choose F_i to minimise the expression. Finally you would convert the F_i into a pattern of groups by scanning each row as described above and then grouping singletons vertically.

Analytic solution

OK, cool. At this point we have a way to express the problem that we can give to a constraint engine.

But it's trivial to solve! We don't need no stinkin' constraint engine to solve this - we can just look at the expression:

sum(1 - F_i) + sum(K * S_i * F_i)

The two sums are over the same pairs, so we can move everything into the sum:

sum(1 - F_i + K * S_i * F_i)
sum(1 + F_i * (K * S_i - 1))

And then extract the constant (N here is the total number of pairs):

N + sum(F_i * (K * S_i - 1))

Now note that each term in the sum is independent (and additive). So for each term, we want the minimum value. We have two options:

  • if F_i is 0 then the entire term is 0.

  • otherwise, F_i is 1 and the term is K * S_i - 1.

So the best choice depends on whether K * S_i is greater than 1. If K * S_i is greater than 1 then the smallest value of the term is 0, and F_i should be 0. Otherwise the second choice above is negative, and F_i should be one.

Trivial algorithm

What does this mean? It means that for each pair we can simply look at the number of spaces, S_i. If that is greater than 1 / K then the two points should be in separate groups. Otherwise they should be in the same group.

So all this fancy maths and optimisation and constraints and bullshitting comes down to: how far apart are two points in neighbouring pairs? If they are closer than some cut-off, put them in the same horizontal group. Otherwise, put them in separate groups.

So here, finally, is your algorithm:

choose some cut-off value, X
place each point in its own, singleton, horizontal group
for each row with more than one point:
    for each neighbouring pair in the row:
        if the space between the pair is less than X:
            join into a single horizontal group
for each column:
    join any remaining singleton groups into a single vertical group

Conclusion

  • You can use constraint programming techniques to solve this problem, but such techniques are restricted to solutions that describe the system in the "correct" (typically, linear) way.

  • The simplest such approach I can find is equivalent to a trivial, direct algorithm that divides points in a row into horizontal groups depending on the number of spaces between them.

  • This all depends on a whole pile of assumptions about what you wanted which may, of course, be over-simplifications, or just plain wrong.

这篇关于有没有好的办法做到这一点类型的挖掘?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆