python numpy:在numpy二维数组中的每一对列上执行功能吗? [英] Python numpy: perform function on each pair of columns in a numpy 2-D array?

查看:675
本文介绍了python numpy:在numpy二维数组中的每一对列上执行功能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一个函数应用于numpy数组中的每一对列(每列都是一个人的基因型).

I'm trying to apply a function to each pair of columns in a numpy array (each column is an individual's genotype).

例如:

[48]: g[0:10,0:10]

array([[ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1, -1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1, -1],
      [-1, -1,  0, -1, -1, -1, -1, -1, -1,  0],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1],
      [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1]], dtype=int8)

我的目标是生成距离矩阵d,以使d中的每个元素都是成对的距离,将g中的每一列进行比较.

My goal is to produce a distance matrix d so that each element of d is the pairwise distance comparing each column in g.

d[0,1] = func(g[:,0], g[:,1])

任何想法都太棒了!谢谢!

Any ideas would be fantastic! Thank you!

推荐答案

您可以简单地将函数定义为:

You can simply define the function as:

def count_snp_diffs(x, y): 
    return np.count_nonzero((x != y) & (x >= 0) & (y >= 0),axis=0)

然后使用由itertools.combinations生成的数组作为索引来调用它,以获取所有可能的列组合:

And then call it using as index an array generated with itertools.combinations, in order to get all possible column combinations:

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
dist = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

此外,如果将输出必须存储在矩阵中(对于大的g,由于只有上三角会被填充,而其余所有都是无用的信息,因此不建议使用此矩阵,这是不推荐的信息)可以用相同的技巧实现:

In addition, if the output must be stored in a matrix (which for large g is not recomendes because only the upper triangle will be filled and all the rest will be useless info, this can be achieved with the same trick:

d = np.zeros((g.shape[1],g.shape[1]))
combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
d[combinations[:,0],combinations[:,1]] = count_snp_diffs(g[:,combinations[:,0]], g[:,combinations[:,1]])

现在,d[i,j]返回列ij之间的距离(而d[j,i]为零).这种方法依赖于以下事实:可以使用列表或包含重复索引的数组对数组进行索引:

Now, d[i,j] returns the distance between columns i and j (whereas d[j,i] is a zero). This approach relies in the fact that arrays can be indexed with lists or arrays containing repeated indexes:

a = np.arange(3)+4
a[[0,1,1,1,0,2,1,1]]
# Out
# [4, 5, 5, 5, 4, 6, 5, 5]

这里是一步一步地说明正在发生的事情.

Here is one step by step explanation of what is happening.

调用g[:,combinations[:,0]]会访问排列的第一个集合中的所有列,从而生成一个新的数组,然后将其与g[:,combinations[:,1]]生成的数组逐列进行比较.因此,生成了布尔数组diff.如果g有3列,则看起来像这样,其中每一列是0,10,21,2列的比较:

Calling g[:,combinations[:,0]] accesses all the columns in the first clumn of permutations, generating a new array, which is compared column by column with the array generated with g[:,combinations[:,1]]. Thus, A boolean array diff is generated. If g had 3 columns it would look like this, where each column is the comparison of columns 0,1, 0,2 and 1,2:

[[ True False False]
 [False  True False]
 [ True  True False]
 [False False False]
 [False  True False]
 [False False False]]

最后,将各列的值相加:

And finally, the values for each column are added:

np.count_nonzero(diff,axis=0)
# Out
# [2 3 0]

此外,由于python中的布尔类继承自整数类(大致为False==0和and True==1),请参见此了解更多信息. np.count_nonzero为每个True位置加1,这与通过np.sum获得的结果相同:

In addition, due to the fact that the boolean class in python inherits from integer class (roughly False==0 and and True==1, see this answer of "Is False == 0 and True == 1 in Python an implementation detail or is it guaranteed by the language?" for more info). The np.count_nonzero adds 1 for each True position, which is the same result obtained with np.sum:

np.sum(diff,axis=0) 
# Out
# [2 3 0]

关于性能和内存的注意事项

对于大型阵列,一次处理整个阵列可能需要太多的内存,您可能会得到Memory Error,但是,对于小型或中型阵列,它往往是最快的方法.在某些情况下,按块工作可能会很有用:

Notes on performance and memory

For large arrays, working with the whole array at a time can require too much memory and you can get a Memory Error, however, for small or medium arrays, it tends to be the fastest approach. In some cases it can be useful to work by chunks:

combinations = np.array(list(itertools.combinations(range(g.shape[1]),2)))
n = len(combinations)
dist = np.empty(n)
# B = np.zeros((g.shape[1],g.shape[1]))
chunk = 200
for i in xrange(chunk,n,chunk):
    dist[i-chunk:i] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
    # B[combinations[i-chunk:i,0],combinations[i-chunk:i,1]] = count_snp_diffs(g[:,combinations[i-chunk:i,0]], g[:,combinations[i-chunk:i,1]])
dist[i:] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])
# B[combinations[i:,0],combinations[i:,1]] = count_snp_diffs(g[:,combinations[i:,0]], g[:,combinations[i:,1]])

对于g.shape=(300,N)%%timeit在我的计算机上使用python 2.7,numpy 1.14.2和allel 1.1.10报告的执行时间为:

For g.shape=(300,N), the execution times reported by %%timeit in my computer with python 2.7, numpy 1.14.2 and allel 1.1.10 are:

  • 10列
    • numpy +矩阵存储:107 µs
    • numpy + 1D存储:101 µs
    • allel:247 µs
    • 10 columns
      • numpy + matrix storage: 107 µs
      • numpy + 1D storage : 101 µs
      • allel : 247 µs
      • numpy +矩阵存储:15.7毫秒
      • numpy + 1D存储空间:16毫秒
      • allel:22.6毫秒
      • numpy +矩阵存储:1.54秒
      • numpy + 1D存储空间:1.53秒
      • allel:2.28秒

      有了这些数组维,纯numpy比allel模块快了一点,但是应该检查问题中维的计算时间.

      With these array dimensions, pure numpy is a litle faster than allel module, but the computation time should be checked for the dimensions in your problem.

      这篇关于python numpy:在numpy二维数组中的每一对列上执行功能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆