填补大型数据集中的时间戳空白 [英] Fill timestamp gaps in large dataset

查看：93 发布时间：2020/5/24 3:45:56 python pandas data-science

本文介绍了填补大型数据集中的时间戳空白的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含100K +行的数据集，该数据集上的一列是Datetime列，我们将其命名为A.

I have a dataset with like 100K+ rows, one column on this dataset is a Datetime column, let's name it A.

我的数据集按A列排序.

My Dataset is sorted by column A.

我想填充我的数据集的空白"，即:如果我在这两行之后紧随其后:

I want to "Fill gaps" of my Dataset, i.e : if i have these two rows following each others :

0  2019-03-13 08:12:20
1  2019-03-13 08:12:25

我想在它们之间添加缺少的秒数，因此，我将得到这个:

I want to make add missing seconds between them, as a result, i'll have this :

0  2019-03-13 08:12:20
1  2019-03-13 08:12:21
2  2019-03-13 08:12:22
3  2019-03-13 08:12:23
4  2019-03-13 08:12:24
5  2019-03-13 08:12:25

如果两行之间的日期，月份或年份不同，我不想在两行之间生成行.

I don't want to generate rows between two rows if they have different day, month or year.

因此，如果有这两个连续的行:

So if have these two consecutive rows :

0  2019-03-13 08:12:20
1  2019-03-15 08:12:21

我不会添加任何内容.

如果两行之间的时间差大于2小时，我也将无法生成行.

I can't also generate rows if the time difference between my two rows is greater than 2 hours.

因此，如果有这两个连续的行:

So if have these two consecutive rows :

0  2019-03-13 08:12:20
1  2019-03-15 11:12:21

我不会添加任何内容.

这里是一个例子来说明我想要的东西:

Here's an example to illustrate what i want :

df=pd.DataFrame({'A': ["2019-03-13 08:12:20", "2019-03-13 08:12:25", "2019-03-20 08:17:23", "2019-03-22 08:17:25", "2019-03-22 11:12:20", "2019-03-22 11:12:23", "2019-03-24 12:33:23"]})
                     A
0  2019-03-13 08:12:20
1  2019-03-13 08:12:25
2  2019-03-20 08:17:23
3  2019-03-22 08:17:25
4  2019-03-22 11:12:20
5  2019-03-22 11:12:23
6  2019-03-24 12:33:23

最后，我想得到这个结果:

At the end, i want to have this result :

                      A
0   2019-03-13 08:12:20
1   2019-03-13 08:12:21
2   2019-03-13 08:12:22
3   2019-03-13 08:12:23
4   2019-03-13 08:12:24
5   2019-03-13 08:12:25
6   2019-03-20 08:17:23
7   2019-03-22 08:17:25
8   2019-03-22 11:12:20
9   2019-03-22 11:12:21
10  2019-03-22 11:12:22
11  2019-03-22 11:12:23
12  2019-03-24 12:33:23

我尝试过这个:

将熊猫作为pd导入

df=pd.DataFrame({'A': ["2019-03-13 08:12:20", "2019-03-13 08:12:25", "2019-03-20 08:17:23", "2019-03-22 08:17:25", "2019-03-22 11:12:20", "2019-03-22 11:12:23", "2019-03-24 12:33:23"]})
df['A']=pd.to_datetime(df['A'])
fill = [pd.date_range(df.iloc[i]['A'], df.iloc[i+1]['A'], freq='S') for i in range(len(df)-1) if (df.iloc[i+1]['A']-df.iloc[i]['A']).total_seconds()<=7200]
dates = [item for sublist in fill for item in sublist]
df=df.set_index('A').join(pd.DataFrame(index=pd.Index(dates, name='A')), how='outer').reset_index()
print(df)

它正在完成工作，但是速度很慢，有没有更快的方法呢?

It's doing the job, but it's slow, is there any faster way to do this ?

推荐答案

您可以使用 set_index 列能够 resample 每个组，并且 reset_index 以选择所需的列.

You can create a column with a group number where the difference between two consecutive rows are below 2 hours, using diff and cumsum. Then set_index the column A to be able to resample per group and reset_index to select the column you want.

df['gr'] = df.A.diff().gt(pd.Timedelta(hours=2)).cumsum()
df_output = df.set_index('A').groupby('gr', as_index=False).resample('s').sum().reset_index()[['A']]
print (df_output)
                     A
0  2019-03-13 08:12:20
1  2019-03-13 08:12:21
2  2019-03-13 08:12:22
3  2019-03-13 08:12:23
4  2019-03-13 08:12:24
5  2019-03-13 08:12:25
6  2019-03-20 08:17:23
7  2019-03-22 08:17:25
8  2019-03-22 11:12:20
9  2019-03-22 11:12:21
10 2019-03-22 11:12:22
11 2019-03-22 11:12:23
12 2019-03-24 12:33:23

这篇关于填补大型数据集中的时间戳空白的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

填补大型数据集中的时间戳空白 [英] Fill timestamp gaps in large dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

填补大型数据集中的时间戳空白 [英] Fill timestamp gaps in large dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭