数聚类/分割算法 [英] Number clustering/partitioning algorithm

查看:153
本文介绍了数聚类/分割算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有号码的有序1-D阵列。两个阵列的长度和阵列中的数字的值是任意的。我要分区阵列分成k分区,根据数字值,例如比方说,我想4个分区,分散为30%/ 30%/ 20%/ 20%,即前30%的值第一,接下来的30%以后,等我去选择k和分配的百分比。此外,如果相同数量的多次出现在阵列中,它不应该被包含在两个不同的分区。这意味着,分布百分比以上是不严格的,而是在目标或出发点,如果你想。

I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two different partitions. This means that the distribution percentages above are not strict, but rather the "goals" or "starting points" if you wish.

例如,假设我的数组是 AR = [1,5,5,6,7,8,8,8,8,8]

For example, let's say my array is ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8].

我选择 K = 4 键,数字应能分配到分区A,B,C和D,百分比 PA = PB = PC = PD = 25%

I choose k = 4 and the numbers should be distributed into partitions A, B, C and D with percentages pA = pB = pC = pD = 25%.

鉴于我给上面的限制,导致分区应该是:

Given the constraints I gave above, the resulting partitions should be:

A = [1] B = [5,5] C = [6,7] D = [8,8,8,8,8]

A = [1] B = [5, 5] C = [6, 7] D = [8, 8, 8, 8, 8]

与所得的(实现/修正)的百分比 PCA = 10%,PCB = 20%,PCC = 20%,pcD等= 50%

with resulting (achieved/corrected) percentages pcA = 10%, pcB = 20%, pcC = 20%, pcD = 50%

在我看来,我需要改进的k-means算法,因为标准的算法不能保证尊重我的百分比和/或相同的值不能在多个集群/分区的要求。

It seems to me that I need a modified k-means algorithm, because the standard algorithm is not guaranteed to respect my percentages and/or the requirement that the same value cannot be in more than one cluster/partition.

那么,有没有一种算法,这种集群?

So, is there an algorithm for this kind of clustering?

推荐答案

下面是一个动态规划的解决方案,认定,最大限度地降低了零件的尺寸误差的平方和分区。因此,在你的实施例[1,5,5,6,7,8,8,8,8,8],想要尺寸(2.5,2.5,2.5,2.5)和结果的份此$ C $定c是(9.0,(1,2,2,5))。这意味着所选择的分区是尺寸1,2,2和5的总误差为9 =(2.5-1)^ 2 +(2.5-2)^ 2 +(2.5-2)^ 2 +(2.5 5)^ 2。

Here's a dynamic programming solution that finds a partition that minimizes the sum of squares of the errors in the sizes of the parts. So in your example of [1, 5, 5, 6, 7, 8, 8, 8, 8, 8], you want parts of size (2.5, 2.5, 2.5, 2.5) and the result given by this code is (9.0, (1, 2, 2, 5)). That means the partitions chosen were of size 1, 2, 2 and 5, and the total error is 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5-5)^2.

def partitions(a, i, sizes, cache):
    """Find a least-cost partition of a[i:].

    The ideal sizes of the partitions are stored in the tuple 'sizes'
    and cache is used to memoize previously calculated results.
    """
    key = (i, sizes)
    if key in cache: return cache[key]
    if len(sizes) == 1:
        segment = len(a) - i
        result = (segment - sizes[0]) ** 2, (segment,)
        cache[key] = result
        return result
    best_cost, best_partition = None, None
    for j in xrange(len(a) - i + 1):
        if 0 < j < len(a) - i and a[i + j - 1] == a[i + j]:
            # Avoid breaking a run of one number.
            continue
        bc, bp = partitions(a, i + j, sizes[1:], cache)
        c = (j - sizes[0]) ** 2 + bc
        if best_cost is None or c < best_cost:
            best_cost = c
            best_partition = (j,) + bp
    cache[key] = (best_cost, best_partition)
    return cache[key]


ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
sizes = (len(ar) * 0.25,) * 4
print partitions(ar, 0, (2.5, 2.5, 2.5, 2.5), {})

这篇关于数聚类/分割算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆