Python集群变量在元组列表中被2个因子影响 [英] Python cluster variables in list of tuples by 2 factors silmutanously

查看:69
本文介绍了Python集群变量在元组列表中被2个因子影响的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码:

from math import sqrt
array = [(1,'a',10), (2,'a',11), (3,'c',200), (60,'a',12), (70,'t',13), (80,'g',300), (100,'a',305), (220,'c',307), (230,'t',306), (250,'g',302)]


def stat(lst):
    """Calculate mean and std deviation from the input list."""
    n = float(len(lst))
    mean = sum([pair[0] for pair in lst])/n
##    mean2 = sum([pair[2] for pair in lst])/n
    stdev = sqrt((sum(x[0]*x[0] for x in lst) / n) - (mean * mean))
##    stdev2 = sqrt((sum(x[2]*x[2] for x in lst) / n) - (mean2 * mean2)) 

    return mean, stdev

def parse(lst, n):
    cluster = []
    for i in lst:
        if len(cluster) <= 1:    # the first two values are going directly in
            cluster.append(i)
            continue
###### add also the distance between lengths
        mean,stdev = stat(cluster)
        if (abs(mean - i[0]) > n * stdev):   # check the "distance"
            yield cluster
            cluster[:] = []    # reset cluster to the empty list

        cluster.append(i)
    yield cluster           # yield the last cluster

for cluster in parse(array, 7):
    print(cluster)

它通过查看变量i [0],将我的元组(数组)列表聚类。
我还要实现的是通过每个元组中的变量i [2]对它进行进一步的聚类。

What it does it clusters my list of tuples (array) by looking at the variable i[0]. What I want to also implement is further cluster it also by variable i[2] in each of my tuple.

当前输出为:

[(1, 'a', 10), (2, 'a', 11), (3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13), (80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]

,我想这样:

[(1, 'a', 10), (2, 'a', 11)]
[(3, 'c', 200)]
[(60, 'a', 12), (70, 't', 13)]
[(80, 'g', 300), (100, 'a', 305)]
[(220, 'c', 307), (230, 't', 306), (250, 'g', 302)]

所以i [0]的值是附近,我[2]也。有任何想法如何破解它吗?

So the values of i[0] are close by and i[2] also. Any ideas how to crack it?

推荐答案

首先,您计算方差的方法数值不稳定 >。 E(X ^ 2)-E(X)^ 2 在数学上成立,但破坏了数值精度。最糟糕的情况是您得到负值,然后 sqrt 就会失败。

First of all, your way of computing variance is numerically unstable. E(X^2)-E(X)^2 holds mathematically, but kills numerical precision. Worst case is you get a negative value, and sqrt then fails.

您真的应该研究 numpy 可以为您正确计算。

You really should look into numpy which can compute this properly for you.

从概念上讲,您是否考虑过将数据视为二维数据空间?然后,您可以变白它,并运行例如k均值或任何其他基于矢量的聚类算法。

Conceptually, have you considered treating your data as a 2-dimensional data space? You could then whiten it, and run e.g. k-means or any other vector based clustering algorithm.

标准差和均值对于抽象化为多个属性都是微不足道的(请查找 Mahalanobis距离)。

Standard deviation and mean are trivial to abstract to multiple attributes (look up "Mahalanobis distance").

这篇关于Python集群变量在元组列表中被2个因子影响的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆