pandas 如何找到序列中的连续值,其差异在一定距离内 [英] pandas how to find continuous values in a series whose differences are within a certain distance

查看:65
本文介绍了 pandas 如何找到序列中的连续值,其差异在一定距离内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas Series,它由int s

I have a pandas Series that is composed of ints

a = np.array([1,2,3,5,7,10,13,16,20])
pd.Series(a)

0  1
1  2
2  3
3  5
4  7
5  10
6  13
7  16
8  20

现在,我要将系列分为几组,每组中两个相邻值之间的差异为<=距离.例如,如果距离定义为1,则我们有

now I want to cluster the series into groups that in each group, the differences between two neighbour values are <= distance. For example, if the distance is defined as 1, we have

[1,2,3], [5], [7], [10], [13], [16], [20]

如果距离是2,我们有

[1,2,3,5,7], [10], [13], [16], [20]

如果距离是3,我们有

[1,2,3,5,7,10,13,16], [20]

如何使用pandas/numpy执行此操作?

how to do this using pandas/numpy?

推荐答案

这是一种方法-

np.split(a,np.flatnonzero(np.diff(a)>d)+1)

作为输出列表列表的功能-

As a function to output list of lists -

def splitme(a,d) : 
    return list(map(list,np.split(a,np.flatnonzero(np.diff(a)>d)+1)))

为了提高性能,我建议使用zip来获取开始,停止索引,然后进行切片,从而避免使用np.split,它可能会成为瓶颈-

For performance, I would suggest using zip to get the start, stop indices and then slicing, thus avoiding np.split which might prove to be the bottleneck -

def splitme_zip(a,d) : 
    m = np.concatenate(([True],a[1:] > a[:-1] + d,[True]))
    idx = np.flatnonzero(m)
    l = a.tolist()
    return [l[i:j] for i,j in zip(idx[:-1],idx[1:])]

如果需要将输出作为数组列表,请使用.tolist/map(list,)跳过列表转换.

If you need the output as a list of arrays, skip the list conversion with .tolist/map(list,).

样品运行-

In [122]: a = np.array([1,2,3,5,7,10,13,16,20])

In [123]: splitme(a,1)
Out[123]: [[1, 2, 3], [5], [7], [10], [13], [16], [20]]

In [124]: splitme(a,2)
Out[124]: [[1, 2, 3, 5, 7], [10], [13], [16], [20]]

In [125]: splitme(a,3)
Out[125]: [[1, 2, 3, 5, 7, 10, 13, 16], [20]]

运行时测试-

In [180]: a = np.sort(np.random.randint(1,10000*2,(10000)))

In [181]: s = pd.Series(a)

In [182]: d = 3

In [183]: %timeit pandas_way(s,d) #@cᴏʟᴅsᴘᴇᴇᴅ's soln
10 loops, best of 3: 55.1 ms per loop

In [184]: %timeit np.split(a,np.flatnonzero(np.diff(a)>d)+1)
     ...: %timeit splitme(a,d)
     ...: %timeit splitme_zip(a,d)
1000 loops, best of 3: 1.47 ms per loop
100 loops, best of 3: 2.87 ms per loop
1000 loops, best of 3: 516 µs per loop

In [185]: a
Out[185]: array([    2,     2,     2, ..., 19992, 19996, 19999])

这篇关于 pandas 如何找到序列中的连续值,其差异在一定距离内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆