pandas :NaN孔的游程长度 [英] Pandas: run length of NaN holes

查看:63
本文介绍了 pandas :NaN孔的游程长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数百个时间序列对象,每个对象有100000个条目. 某些百分比的数据条目丢失(NaN). 无论是单个的,分散的NaN还是较长的NaN序列,对我的应用程序都非常重要.

I have hundreds of timeseries objects with 100000's of entries in each. Some percentage of the data entries are missing (NaN). It is important to my application whether those are single, scattered NaNs or long sequences of NaNs.

因此,我想要一个函数,用于给我每个NaN连续序列的游程长度. 我能做

Therefore I would like a function for giving me the runlength of each contiguous sequence of NaN. I can do

myseries.isnull()

得到一系列布尔值.而且我可以移动中位数或移动平均数来了解数据漏洞的大小. 但是,如果有一种效率的方法来获取一系列孔的长度,那将是很好的选择.

to get a series of bool. And I can do moving median or moving average to get an idea about the size of the data holes. However, it would be nice if there was an efficient way of getting a list of hole lenghts for a series.

也就是说,最好有一个myfunc这样

I.e., it would be nice to have a myfunc so that

a = pdSeries([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
myfunc(a.isnull())
==> Series([1, 3, 2])

(因为分别有1、3和2个NaN)

(because there are 1, 3 and 2 NaNs, respectively)

由此,我可以绘制孔长,多个系列的isull的andor(可能是彼此的替代品)和其他优点的直方图.

From that, I can make histograms of hole lengths, and of the and or or of isnull of multiple series (that might be substitutes for eachother), and other nice things.

我还想了解其他方法来量化数据漏洞的簇状性".

I would also like to get ideas of other ways to quantify the "clumpiness" of the data holes.

推荐答案

import pandas as pd
import numpy as np
import itertools

a = pd.Series([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
len_holes = [len(list(g)) for k, g in itertools.groupby(a, lambda x: np.isnan(x)) if k]
print len_holes

结果

[1, 3, 2]

这篇关于 pandas :NaN孔的游程长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆