提高 np.where() 循环的性能 [英] Increase performance of np.where() loop

查看:70
本文介绍了提高 np.where() 循环的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在没有多处理的情况下加速以下脚本的代码(理想情况下为 >4x).下一步我会实现多处理,但是现在的速度即使我拆分成40核也太慢了.因此,我首先尝试优化代码.

将 numpy 导入为 np定义循环(x,y,q,z):匹配列表 = []对于范围内的 ind(len(x)):matchlist.append(find_match(x[ind],y[ind],q,z))返回匹配列表def find_match(x,y,q,z):A = np.where(q == x)B = np.where(z == y)返回 np.intersect1d(A,B)# N 最终会扩展到 10^9N = 1000米 = 300X = np.random.randint(M, size=N)Y = np.random.randint(M, size=N)# Q 和 Z 大小固定为 120000Q = np.random.randint(M, size=120000)Z = np.random.randint(M, size=120000)# 将 int32 数组转换为 str64 数组,以表示原始数据(是字符串而不是数字)X = np.char.mod('%d', X)Y = np.char.mod('%d', Y)Q = np.char.mod('%d', Q)Z = np.char.mod('%d', Z)匹配列表 = 循环(X、Y、Q、Z)

一些细节:

我有两个长度相同的数组(X 和 Y).这些阵列的每一行对应一个 DNA 测序读数(基本上是字母A"、C"、G"、T"的字符串;与此处的示例代码无关的详细信息).

我还有两个长度相同的参考数组"(Q 和 Z),我想找到 Q 中 X 的每个元素以及每个元素的出现(使用 np.where())Y 在 Z 中(基本上是 find_match() 函数).之后我想知道为 X 和 Y 找到的索引之间是否存在重叠/交叉.

示例输出(匹配列表;X/Y 的某些行在 Q/Y 中有匹配的索引,有些没有,例如索引 11):

到目前为止,代码运行良好,但在 N=10^9 的最终数据集上执行需要很长时间(在此代码示例中 N=1000 以加快测试速度).在我的笔记本电脑上执行 1000 行 X/Y 需要大约 2.29 秒:

每个 find_match() 执行大约需要 2.48 毫秒,大约是最终循环的 1/1000.

第一种方法是将 (x 与 y) 以及 (q 与 z) 结合起来,然后我只需要运行 np.where() 一次,但我还不能让它工作.

到目前为止我尝试过的:

  • 在 Pandas (.loc()) 中循环和查找,但这比 np.where() 方法慢了大约 4 倍

该问题与 philshem 最近的一个问题密切相关(Combine Numpywhere"语句),但是,解决方案在这个问题上提供的内容不适用于我在这里的方法.

解决方案

Numpy 在这里用处不大,因为您需要的是查找锯齿状数组,以字符串作为索引.

lookup = {}对于 enumerate(zip(Q, Z)) 中的 i, (q, z):lookup.setdefault((q, z), []).append(i)matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]

如果您不需要将输出作为锯齿状数组,但只需要一个布尔值表示存在就可以了,并且可以将每个字符串预处理为一个数字,那么有一种更快的方法.

lookup = np.zeros((300, 300), dtype=bool)查找 [Q, Z] = 真匹配列表 = 查找 [X, Y]

您通常不希望使用此方法来替换以前的锯齿状情况,因为密集变体(例如 Daniel F 的解决方案)将导致内存效率低下,并且 numpy 不能很好地支持稀疏数组.但是,如果需要更高的速度,那么稀疏解决方案当然是可能的.

I am trying to speed up the code for the following script (ideally >4x) w/o multiprocessing. In a future step, I will implement multiprocessing, but the current speed is too slow even if I split it up to 40 cores. Therefore I'm trying to optimize to code first.

import numpy as np

def loop(x,y,q,z):
    matchlist = []
    for ind in range(len(x)):
        matchlist.append(find_match(x[ind],y[ind],q,z))
    return matchlist

def find_match(x,y,q,z):
    A = np.where(q == x)
    B = np.where(z == y)
    return np.intersect1d(A,B)


# N will finally scale up to 10^9
N = 1000
M = 300

X = np.random.randint(M, size=N)
Y = np.random.randint(M, size=N)

# Q and Z size is fixed at 120000
Q = np.random.randint(M, size=120000)
Z = np.random.randint(M, size=120000)

# convert int32 arrays to str64 arrays, to represent original data (which are strings and not numbers)
X = np.char.mod('%d', X)
Y = np.char.mod('%d', Y)
Q = np.char.mod('%d', Q)
Z = np.char.mod('%d', Z)

matchlist = loop(X,Y,Q,Z)

Some details:

I have two arrays (X and Y) which are identical in length. Each row of these arrays corresponds to one DNA sequencing read (basically strings of the letters 'A','C','G','T'; details not relevant for the example code here).

I also have two 'reference arrays' (Q and Z) which are identical in length and I want to find the occurrence (with np.where()) of every element of X within Q, as well as of every element of Y within Z (basically the find_match() function). Afterwards I want to know whether there is an overlap/intersect between the indexes found for X and Y.

Example output (matchlist; some rows of X/Y have matching indexes in Q/Y, some don't e.g. index 11):

The code works fine so far, but it would take quite long to execute with my final dataset where N=10^9 (in this code example N=1000 to make the tests quicker). 1000 rows of X/Y need about 2.29s to execute on my laptop:

Every find_match() takes about 2.48 ms to execute which is roughly 1/1000 of the final loop.

One first approach would be to combine (x with y) as well as (q with z) and then I only need to run np.where() once, but I couldn't get that to work yet.

What I've tried so far:

  • loop and lookup within Pandas (.loc()) but this was about 4x slower than np.where() method

The question is closely related to a recent question from philshem (Combine Numpy "where" statements), however, the solutions provided on this question do not work for my approach here.

解决方案

Numpy isn't too helpful here, since what you need is a lookup into a jagged array, with strings as the indexes.

lookup = {}
for i, (q, z) in enumerate(zip(Q, Z)):
    lookup.setdefault((q, z), []).append(i)

matchlist = [lookup.get((x, y), []) for x, y in zip(X, Y)]

If you don't need the output as a jagged array, but are OK with just a boolean denoting presence, and can preprocess each string to a number, there is a much faster method.

lookup = np.zeros((300, 300), dtype=bool)
lookup[Q, Z] = True
matchlist = lookup[X, Y]

You typically won't want to use this method to replace the former jagged case, as dense variants (eg. Daniel F's solution) will be memory inefficient and numpy does not support sparse arrays well. However, if more speed is needed then a sparse solution is certainly possible.

这篇关于提高 np.where() 循环的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆