选择不重叠的最佳质量集群 [英] Selecting non-overlapping best quality clusters

查看:78
本文介绍了选择不重叠的最佳质量集群的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说,我已经对数据集进行了聚类,并且有10个聚类。这些群集是不重叠的。但是现在假设我更改了所有数据点的某些功能,然后再次进行聚类。现在,我还有10个群集。
如果再重复说3次,最后我将拥有50个群集。
每个群集都有与之相关的分数,该分数是根据其成分数据点计算得出的。

Say, I have done clustering on my dataset and have 10 clusters. These clusters are non-overlapping. But now assume I changed some feature in all my data points and do clustering again. Now I have 10 more clusters. If I repeat it say 3 more times, at the end I would have 50 clusters. Each cluster has a score associated with it that is calculated from its constituents data points.

这50个群集现在具有重叠的数据点。我想从这50个群集中选择所有可能的不重叠群集,但总得分最高。

These 50 clusters now have overlapping data points. I want to select all possible non-overlapping clusters out of these 50 clusters but with the highest total score.

一种方法是一种贪婪方法,我根据该方法对群集进行排序分数从最高到最小。然后选择得分最高的集群。然后,从那里继续选择具有非重叠数据点的群集,这些群集已经选择了群集。

One way is a greedy method where I sort the clusters based on the score from highest to smallest. Then select highest scoring cluster. Then from there keep selecting clusters that have non-overlapping data points with already selected clusters. But it doesn't seem to be optimal solution although it is fast.

示例:说我有5个具有以下分数的聚类:

Example: say I have 5 clusters with following scores:

C1 =(A,B,C,D,E,F)得分= 10

C1 = (A,B,C,D,E,F) Score = 10

C2 =(A,B,C)得分= 6

C2 = (A,B,C) Score = 6

C3 =(D,E,F)得分= 6

C3 = (D,E,F) Score = 6

C4 =(G,H, I,J)分数= 5

C4 = (G,H,I,J) Score = 5

C5 =(K,L)分数= 7

C5 = (K,L) Score = 7

贪婪的方法将返回{C1,C4,C5},总得分为10 + 5 + 7 = 22,而更好的选择是{C2,C3,C4,C5},总得分为6 + 6 + 5 + 7 = 24。

The greedy approach will return {C1, C4, C5} with a total score of 10+5+7=22, whereas better option is {C2, C3, C4, C5} with a total score of 6+6+5+7=24.

我正在寻找另一种方法,该方法可以提供比上述贪婪方法更好的解决方案。

I am looking for another method that can give an optimal solution or better solution than above mentioned greedy approach.

推荐答案

您可以使用运筹学技术来解决此问题。

You can solve this using operations research techniques.

像使用set-partitioning问题一样用$ p建模此问题

Model this problem like a set-partitioning problem with

Objective function: maximize score
Constraints: each data point is covered exactly once

然后使用MIP求解器或任何其他技术(例如Hill hiller,遗传算法等)对其进行求解。您的问题的规模很小,因此可以通过任何优化算法解决。我也在研究类似的问题,但在航空机组的调度领域。我的问题的规模是如此之大,以至于约4500个航班的航班时刻表(相当于您的数据点)的可能的机组时刻表(相当于您的集群)>成千上万个组合;)

and then solve it using a MIP solver or any other technique (such as Hill climber, Genetic algorithm etc). The scale of your problem is very small, hence solvable by any optimization algorithm. I am also working on a similar problem but in airline crew scheduling domain. The scale of my problem is so huge that the possible crew schedules (equivalent to your clusters) are >zillion combinations for a flight schedule of ~4500 flights (equivalent to your data points) ;)

我已经用python编写了您的示例,并且使用了Gurobi的MIP求解器,可免费用于学术用途。您也可以使用其他MIP求解器。

I have coded your example in python and I have used a MIP solver from Gurobi, available free of cost for academic use. You can use other MIP solvers too.

以下是python代码:

Here is the python code:

from gurobipy import *
import string

data_points = string.ascii_uppercase[:12]

clusters = []
clusters.append(string.ascii_uppercase[:6])
clusters.append(string.ascii_uppercase[:3])
clusters.append(string.ascii_uppercase[3:6])
clusters.append(string.ascii_uppercase[6:10])
clusters.append(string.ascii_uppercase[10:12])

matrix = {}
for dp in string.ascii_uppercase[:12]:
    matrix[dp] = [0]*5

for i in range(0, len(clusters)):
    for dp in clusters[i]:
        matrix[dp][i] = 1

cost = [10, 6, 6, 5, 7]

# Gurobi MIP model
m = Model("Jitin's cluster optimization problem")
m.params.outputflag = 1
x = m.addVars(len(clusters), vtype=GRB.INTEGER, name='x')
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
    coef_x[i] = cost[i]
    obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)
flight_in_pairings = [[] for i in range(0, 4228)]
for dp,j in zip(data_points, range(0, len(data_points))):
    m.addConstr(sum([x[i]*matrix[dp][i] for i in range(0, len(matrix[dp]))]) == 1, "C"+str(j))
m.optimize()
print('Final Obj:', m.objVal)
m.write('results.sol')

代码的输出:

# Solution for model Jitin's cluster optimization problem
# Objective value = 24
x[0] 0
x[1] 1
x[2] 1
x[3] 1
x[4] 1

这篇关于选择不重叠的最佳质量集群的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆