在 numpy 数组中查找相同值序列的长度(运行长度编码) [英] find length of sequences of identical values in a numpy array (run length encoding)

查看:20
本文介绍了在 numpy 数组中查找相同值序列的长度(运行长度编码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 pylab 程序(也可能是 matlab 程序)中,我有一个表示距离的 numpy 数字数组:d[t]distance 在time t(我的数据的时间跨度是 len(d) 时间单位).

In a pylab program (which could probably be a matlab program as well) I have a numpy array of numbers representing distances: d[t] is the distance at time t (and the timespan of my data is len(d) time units).

我感兴趣的事件是当距离低于某个阈值时,我想计算这些事件的持续时间.用b = d 很容易得到一个布尔数组,问题归结为计算b 中的True-only 单词的长度序列.但我不知道如何有效地做到这一点(即使用 numpy 原语),我求助于遍历数组并进行手动更改检测(即,当值从 False 变为 True 时初始化计数器,只要值为 True 就增加计数器,并在值返回到 False 时将计数器输出到序列).但这非常慢.

The events I'm interested in are when the distance is below a certain threshold, and I want to compute the duration of these events. It's easy to get an array of booleans with b = d<threshold, and the problem comes down to computing the sequence of the lengths of the True-only words in b. But I do not know how to do that efficiently (i.e. using numpy primitives), and I resorted to walk the array and to do manual change detection (i.e. initialize counter when value goes from False to True, increase counter as long as value is True, and output the counter to the sequence when value goes back to False). But this is tremendously slow.

如何有效地检测 numpy 数组中的那种序列?

下面是一些说明我的问题的python代码:第四个点需要很长时间才能出现(如果没有,请增加数组的大小)

Below is some python code that illustrates my problem : the fourth dot takes a very long time to appear (if not, increase the size of the array)

from pylab import *

threshold = 7

print '.'
d = 10*rand(10000000)

print '.'

b = d<threshold

print '.'

durations=[]
for i in xrange(len(b)):
    if b[i] and (i==0 or not b[i-1]):
        counter=1
    if  i>0 and b[i-1] and b[i]:
        counter+=1
    if (b[i-1] and not b[i]) or i==len(b)-1:
        durations.append(counter)

print '.'

推荐答案

虽然不是 numpy 原语,itertools 函数通常非常快,所以一定要试试这个(当然,还要测量包括这个解决方案在内的各种解决方案的时间):

While not numpy primitives, itertools functions are often very fast, so do give this one a try (and measure times for various solutions including this one, of course):

def runs_of_ones(bits):
  for bit, group in itertools.groupby(bits):
    if bit: yield sum(group)

如果确实需要列表中的值,当然可以使用list(runs_of_ones(bits));但也许列表理解可能会稍微快一点:

If you do need the values in a list, just can use list(runs_of_ones(bits)), of course; but maybe a list comprehension might be marginally faster still:

def runs_of_ones_list(bits):
  return [sum(g) for b, g in itertools.groupby(bits) if b]

转向numpy-native"可能性,怎么样:

Moving to "numpy-native" possibilities, what about:

def runs_of_ones_array(bits):
  # make sure all runs of ones are well-bounded
  bounded = numpy.hstack(([0], bits, [0]))
  # get 1 at run starts and -1 at run ends
  difs = numpy.diff(bounded)
  run_starts, = numpy.where(difs > 0)
  run_ends, = numpy.where(difs < 0)
  return run_ends - run_starts

再次强调:一定要在现实的例子中对彼此的解决方案进行基准测试!

Again: be sure to benchmark solutions against each others in realistic-for-you examples!

这篇关于在 numpy 数组中查找相同值序列的长度(运行长度编码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆