pandas :NaN孔的游程长度 [英] Pandas: run length of NaN holes
问题描述
我有数百个时间序列对象,每个对象有100000个条目. 某些百分比的数据条目丢失(NaN). 无论是单个的,分散的NaN还是较长的NaN序列,对我的应用程序都非常重要.
I have hundreds of timeseries objects with 100000's of entries in each. Some percentage of the data entries are missing (NaN). It is important to my application whether those are single, scattered NaNs or long sequences of NaNs.
因此,我想要一个函数,用于给我每个NaN连续序列的游程长度. 我能做
Therefore I would like a function for giving me the runlength of each contiguous sequence of NaN. I can do
myseries.isnull()
得到一系列布尔值.而且我可以移动中位数或移动平均数来了解数据漏洞的大小. 但是,如果有一种效率的方法来获取一系列孔的长度,那将是很好的选择.
to get a series of bool. And I can do moving median or moving average to get an idea about the size of the data holes. However, it would be nice if there was an efficient way of getting a list of hole lenghts for a series.
也就是说,最好有一个myfunc
这样
I.e., it would be nice to have a myfunc
so that
a = pdSeries([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
myfunc(a.isnull())
==> Series([1, 3, 2])
(因为分别有1、3和2个NaN)
(because there are 1, 3 and 2 NaNs, respectively)
由此,我可以绘制孔长,多个系列的isull的and
或or
(可能是彼此的替代品)和其他优点的直方图.
From that, I can make histograms of hole lengths, and of the and
or or
of isnull of multiple series (that might be substitutes for eachother), and other nice things.
我还想了解其他方法来量化数据漏洞的簇状性".
I would also like to get ideas of other ways to quantify the "clumpiness" of the data holes.
推荐答案
import pandas as pd
import numpy as np
import itertools
a = pd.Series([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
len_holes = [len(list(g)) for k, g in itertools.groupby(a, lambda x: np.isnan(x)) if k]
print len_holes
结果
[1, 3, 2]
这篇关于 pandas :NaN孔的游程长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!