在多索引数据框中填充缺少的时间值 [英] Filling missing time values in a multi-indexed dataframe

查看:75
本文介绍了在多索引数据框中填充缺少的时间值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据文件,其中包含从多个传感器异步读取的时间序列.基本上,对于文件中的每个数据元素,我都有一个传感器ID和读取它的时间,但是我并不总是每次都具有所有传感器,并且读取时间可能不会均匀分布.像这样:

I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:

ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5  # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9  # skip some time steps
2,5,10

重要提示,实际的time列为日期时间类型.

Important note the actual time column is of datetime type.

我想要的是能够在不存在该传感器的任何时间步长内对每个传感器进行零阶保持(正向填充)值,然后将其设置为零或向后填充在该传感器上未读取的任何传感器.最早的时间步骤.我想要的是一个看起来像从中读取的数据框:

What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:

ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2  # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2  # ID 1 still holding
2,2,7
0,3,6  # ID 0 holding
1,3,2  # ID 1 still holding
2,3,8
0,5,6  # ID 0 still holding, can skip totally missing time steps
1,5,9  # ID 1 finally updates
2,5,10

到目前为止,熊猫尝试过

我初始化数据框并设置索引:

Pandas attempts so far

I initialize my dataframe and set my indices:

df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)

我试图弄乱诸如此类的东西

I try to mess with things like:

filled = df.reindex(method='ffill')

等具有传递给df.index['time']等的index关键字参数的各种值.由于我传递了无效的关键字参数,这总是会引发错误,或者对数据框不可见.我认为无法识别我正在寻找的数据丢失".

or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".

我也尝试过:

df.update(df.groupby(level=0).ffill())

level=1基于 Pandas中的多索引fillna ,但我又没有对数据框进行任何可见的更改,我认为是因为我目前没有什么可以让我的价值观保持不变的东西.

or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.

我对numpy和非整数索引使用了一些运气,例如:

I have had some luck with numpy and non-integer indexing using something like:

data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]

但这有两个问题:

  1. 这将我带出了熊猫世界,现在我必须手动维护我的传感器ID,时间索引等以及特征向量(实际的data列不仅是一列,而且是大量的值)来自传感器套件).
  2. 考虑到列数和实际数据集的大小,在我的实际示例中实现起来将很笨拙且难以实现.我希望在大熊猫中做到这一点.
  1. It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
  2. Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.

应用程序

最终,这只是用于训练递归神经网络的数据清理步骤,其中,对于每个时间步,我都需要提供始终具有相同结构的特征向量(每个时间步,每个传感器ID都有一组测量值).

The application

Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).

谢谢您的帮助!

推荐答案

我可以将其与以下内容一起使用,对于这样的任何情况(在该情况下,您要为其填充值的时间索引为在具有两个索引的多索引中排名第二:

I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:

# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]

# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill())  # first ZOH forward
df.update(df.bfill())  # now back fill values that are not seen at the beginning

# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()

这让我得到了想要的东西,尽管如果有人知道这样做的好方法,我希望能够保留重复的时间条目.

This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.

如果对于特定的应用程序,从零开始看不见的值,则还可以使用df.update(df.fillna(0))而不是回填.

You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.

我将上面的代码块放在一个名为clean_df的函数中,该函数将数据帧作为参数并返回清除的数据帧.

I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.

这篇关于在多索引数据框中填充缺少的时间值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆