在numpy数组中找到相同值的序列的长度(游程长度编码) [英] find length of sequences of identical values in a numpy array (run length encoding)

查看:80
本文介绍了在numpy数组中找到相同值的序列的长度(游程长度编码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在pylab程序(也可能是matlab程序)中,我有一个代表距离的数字的numpy数组:d[t]是时间t距离(以及时间跨度)我的数据是len(d)时间单位.

In a pylab program (which could probably be a matlab program as well) I have a numpy array of numbers representing distances: d[t] is the distance at time t (and the timespan of my data is len(d) time units).

我感兴趣的事件是当距离小于某个阈值时,我想计算这些事件的持续时间.用b = d<threshold获取布尔数组很容易,问题出在计算b中仅True单词的长度顺序上.但是我不知道如何有效地做到这一点(即使用numpy基元),我求助于遍历数组并进行手动更改检测(即,当值从False变为True时初始化计数器,只要value为True便增加计数器) ,并在值返回False时将计数器输出到序列中.但这非常慢.

The events I'm interested in are when the distance is below a certain threshold, and I want to compute the duration of these events. It's easy to get an array of booleans with b = d<threshold, and the problem comes down to computing the sequence of the lengths of the True-only words in b. But I do not know how to do that efficiently (i.e. using numpy primitives), and I resorted to walk the array and to do manual change detection (i.e. initialize counter when value goes from False to True, increase counter as long as value is True, and output the counter to the sequence when value goes back to False). But this is tremendously slow.

如何有效地检测numpy数组中的那种序列?

下面是一些说明我问题的python代码:第四个点需要很长时间才能显示(如果没有,请增加数组的大小)

Below is some python code that illustrates my problem : the fourth dot takes a very long time to appear (if not, increase the size of the array)

from pylab import *

threshold = 7

print '.'
d = 10*rand(10000000)

print '.'

b = d<threshold

print '.'

durations=[]
for i in xrange(len(b)):
    if b[i] and (i==0 or not b[i-1]):
        counter=1
    if  i>0 and b[i-1] and b[i]:
        counter+=1
    if (b[i-1] and not b[i]) or i==len(b)-1:
        durations.append(counter)

print '.'

推荐答案

虽然不是numpy原语,但itertools函数通常非常快,因此请尝试一下(并测量各种解决方案的时间,包括该解决方案) ,当然):

While not numpy primitives, itertools functions are often very fast, so do give this one a try (and measure times for various solutions including this one, of course):

def runs_of_ones(bits):
  for bit, group in itertools.groupby(bits):
    if bit: yield sum(group)

如果确实需要列表中的值,则可以使用list(runs_of_ones(bits));当然,也可以使用list(runs_of_ones(bits)).但也许列表理解仍然会稍微快一些:

If you do need the values in a list, just can use list(runs_of_ones(bits)), of course; but maybe a list comprehension might be marginally faster still:

def runs_of_ones_list(bits):
  return [sum(g) for b, g in itertools.groupby(bits) if b]

转移到"numpy-native"的可能性,那怎么办:

Moving to "numpy-native" possibilities, what about:

def runs_of_ones_array(bits):
  # make sure all runs of ones are well-bounded
  bounded = numpy.hstack(([0], bits, [0]))
  # get 1 at run starts and -1 at run ends
  difs = numpy.diff(bounded)
  run_starts, = numpy.where(difs > 0)
  run_ends, = numpy.where(difs < 0)
  return run_ends - run_starts

再次:一定要在针对您的实际示例中对解决方案进行基准测试!

Again: be sure to benchmark solutions against each others in realistic-for-you examples!

这篇关于在numpy数组中找到相同值的序列的长度(游程长度编码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆