Python集群变量在元组列表中被2个因子影响 [英] Python cluster variables in list of tuples by 2 factors silmutanously
问题描述
我有以下代码:
from math import sqrt
array = [(1,'a',10), (2,'a',11), (3,'c',200), (60,'a',12), (70,'t',13), (80,'g',300), (100,'a',305), (220,'c',307), (230,'t',306), (250,'g',302)]
def stat(lst):
"""Calculate mean and std deviation from the input list."""
n = float(len(lst))
mean = sum([pair[0] for pair in lst])/n
## mean2 = sum([pair[2] for pair in lst])/n
stdev = sqrt((sum(x[0]*x[0] for x in lst) / n) - (mean * mean))
## stdev2 = sqrt((sum(x[2]*x[2] for x in lst) / n) - (mean2 * mean2))
return mean, stdev
def parse(lst, n):
cluster = []
for i in lst:
if len(cluster) <= 1: # the first two values are going directly in
cluster.append(i)
continue
###### add also the distance between lengths
mean,stdev = stat(cluster)
if (abs(mean - i[0]) > n * stdev): # check the "distance"
yield cluster
cluster[:] = [] # reset cluster to the empty list
cluster.append(i)
yield cluster # yield the last cluster
for cluster in parse(array, 7):
print(cluster)
它通过查看变量i [0],将我的元组(数组)列表聚类。
我还要实现的是通过每个元组中的变量i [2]对它进行进一步的聚类。
What it does it clusters my list of tuples (array) by looking at the variable i[0]. What I want to also implement is further cluster it also by variable i[2] in each of my tuple.
当前输出为:
[(1, 'a', 10), (2, 'a', 11), (3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13), (80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]
,我想这样:
[(1, 'a', 10), (2, 'a', 11)]
[(3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13)]
[(80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]
所以i [0]的值是附近,我[2]也。有任何想法如何破解它吗?
So the values of i[0] are close by and i[2] also. Any ideas how to crack it?
推荐答案
首先,您计算方差的方法数值不稳定 >。 E(X ^ 2)-E(X)^ 2
在数学上成立,但破坏了数值精度。最糟糕的情况是您得到负值,然后 sqrt
就会失败。
First of all, your way of computing variance is numerically unstable. E(X^2)-E(X)^2
holds mathematically, but kills numerical precision. Worst case is you get a negative value, and sqrt
then fails.
您真的应该研究 numpy
可以为您正确计算。
You really should look into numpy
which can compute this properly for you.
从概念上讲,您是否考虑过将数据视为二维数据空间?然后,您可以变白它,并运行例如k均值或任何其他基于矢量的聚类算法。
Conceptually, have you considered treating your data as a 2-dimensional data space? You could then whiten it, and run e.g. k-means or any other vector based clustering algorithm.
标准差和均值对于抽象化为多个属性都是微不足道的(请查找 Mahalanobis距离)。
Standard deviation and mean are trivial to abstract to multiple attributes (look up "Mahalanobis distance").
这篇关于Python集群变量在元组列表中被2个因子影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!