对对象进行分组以获得所有组的相似均值属性 [英] grouping objects to achieve a similar mean property for all groups

查看:69
本文介绍了对对象进行分组以获得所有组的相似均值属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个对象集合,每个对象都有一个数字权重".我想创建这些对象的组,以使每个组的对象权重的算术平均值大致相同.

I have a collection of objects, each of which has a numerical 'weight'. I would like to create groups of these objects such that each group has approximately the same arithmetic mean of object weights.

组的成员数不一定相同,但是组的大小在彼此之内.就数量而言,将有50至100个对象,并且最大组大小约为5.

The groups won't necessarily have the same number of members, but the size of groups will be within one of each other. In terms of numbers, there will be between 50 and 100 objects and the maximum group size will be about 5.

这是众所周知的问题吗?似乎有点像背包问题或分区问题.是否知道有效的算法可以解决该问题?

Is this a well-known type of problem? It seems a bit like a knapsack or partition problem. Are efficient algorithms known to solve it?

第一步,我创建了一个python脚本,该脚本通过按权重对对象进行分类,对这些对象进行分组,然后将每个子组的成员分配给其中一个最终组,从而实现平均权重的粗略等价.

As a first step, I created a python script that achieves very crude equivalence of mean weights by sorting the objects by weight, subgrouping these objects, and then distributing a member of each subgroup to one of the final groups.

我很喜欢使用python进行编程,因此,如果存在现有的软件包或模块来实现此功能的一部分,我将不胜感激.

I am comfortable programming in python, so if existing packages or modules exist to achieve part of this functionality, I'd appreciate hearing about them.

感谢您的帮助和建议.

推荐答案

您可以尝试使用 k-表示聚类:

import scipy.cluster.vq as vq
import collections
import numpy as np

def auto_cluster(data,threshold=0.1,k=1):
    # There are more sophisticated ways of determining k
    # See http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
    data=np.asarray(data)
    distortion=1e20
    while distortion>threshold:
        codebook,distortion=vq.kmeans(data,k)
        k+=1   
    code,dist=vq.vq(data,codebook)    
    groups=collections.defaultdict(list)
    for index,datum in zip(code,data):
        groups[index].append(datum)
    return groups

np.random.seed(784789)
N=20
weights=100*np.random.random(N)
groups=auto_cluster(weights,threshold=1.5,k=N//5)
for index,data in enumerate(sorted(groups.values(),key=lambda d: np.mean(d))):
    print('{i}: {d}'.format(i=index,d=data))

上面的代码生成N个权重的随机序列. 它使用 scipy.cluster.vq.kmeans 将序列划分为相互靠近的k个数字簇.如果失真高于阈值,则在增加k的情况下重新计算kmeans.重复此操作,直到失真低于给定的阈值为止.

The code above generates a random sequence of N weights. It uses scipy.cluster.vq.kmeans to partition the sequence into k clusters of numbers which are close together. If the distortion is above a threshold, the kmeans is recomputed with k increased. This repeats until the distortion is below the given threshold.

它产生这样的簇:

0: [4.9062151907551366]
1: [13.545565038022112, 12.283828883935065]
2: [17.395300245930066]
3: [28.982058040201832, 30.032607500871023, 31.484125759701588]
4: [35.449637591061979]
5: [43.239840915978043, 48.079844689518424, 40.216494950261506]
6: [52.123246083619755, 53.895726546070463]
7: [80.556052179748079, 80.925071671718413, 75.211470587171803]
8: [86.443868931310249, 82.474064251040375, 84.088655128258964]
9: [93.525705849369416]

请注意,k均值聚类算法使用随机猜测来初始选择k组的中心.这意味着重复执行相同的代码可能会产生不同的结果,尤其是如果权重没有将自己分成明显不同的组时.

Note that the k-means clustering algorithm uses random guesses to initially pick centers of the k groups. This means that repeated runs of the same code can produce different results, particularly if the weights do not separate themselves into clearly distinct groups.

您还必须旋转阈值参数以产生所需数量的组.

You'll also have to twiddle the threshold parameter to produce the desired number of groups.

这篇关于对对象进行分组以获得所有组的相似均值属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆