计算数据帧中记录之间的时差 [英] calculating delta time between records in dataframe

查看:176
本文介绍了计算数据帧中记录之间的时差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有趣的问题,我试图计算在不同地点完成的记录之间的差值时间。

  id xy时间
1 x1 y1 10
1 x1 y1 12
1 x2 y2 14
2 x4 y4 8
2 x5 y5 12

我正在尝试一些东西,如

  id xy time delta 
1 x1 y1 10 4
1 x2 y2 14 0
2 x4 y4 8 4
2 x5 y5 12 0

我使用自定义UDTF完成了使用HiveQL的这种类型的处理,但正在考虑如何通过DataFrame实现这一点(可能是在R,熊猫,PySpark)。理想情况下,我正在为Python大熊猫和pyspark寻找解决方案。



任何提示都是赞赏,谢谢你的时间!

解决方案

我想你需要 drop_duplicates groupby DataFrameGroupBy.diff 转移 fillna

  df1 = df.drop_duplicates(subset = ['id','x','y'])。copy()

df1 ['delta'] = df1.groupby(['id'])['time']。 diff() ft(-1).fillna(0)

最终代码:

  import pandas as pd df = pd.read_csv(sampleInput.txt,
header = None,
usecols = [0,1 ,2,3],
names = ['id','x','y','time'],
sep =\t)

delta = df.groupby(['id','x','y'])。first()。reset_index()
delta ['delta'] = delta.groupby('id')['time '] .diff()。shift(-1).fillna(0)

 在[111]中:%timeit df.groupby(['id','x' y'])first()。reset_index()
100循环,最好3:2.42 ms每循环

在[112]:%timeit df.drop_duplicates(subset = [' id','x','y'])copy()
1000循环,最好是3:658μs每循环


I have an interesting problem, I am trying to calculate the delta time between records done at different locations.

id x y time
1  x1 y1 10
1  x1 y1 12
1  x2 y2 14
2  x4 y4 8
2  x5 y5 12

I am trying to get some thing like

id x y time delta
1 x1 y1 10   4
1 x2 y2 14   0
2 x4 y4 8    4
2 x5 y5 12   0

I have done this type of processing with HiveQL by using custom UDTF but was thinking how can I achieve this with DataFrame in general (may it be in R, Pandas, PySpark). Ideally, I am trying to find a solution for Python pandas and pyspark.

Any hint is appreciated, thank you for your time !

解决方案

I think you need drop_duplicates with groupby with DataFrameGroupBy.diff, shift and fillna:

df1 = df.drop_duplicates(subset=['id','x','y']).copy()

df1['delta'] = df1.groupby(['id'])['time'].diff().shift(-1).fillna(0)

Final code:

import pandas as pd df = pd.read_csv("sampleInput.txt", 
                                      header=None,
                                      usecols=[0,1,2,3], 
                                      names=['id','x','y','time'],
                                      sep="\t") 

delta = df.groupby(['id','x','y']).first().reset_index() 
delta['delta'] = delta.groupby('id')['time'].diff().shift(-1).fillna(0)

Timings:

In [111]: %timeit df.groupby(['id','x','y']).first().reset_index()
100 loops, best of 3: 2.42 ms per loop

In [112]: %timeit df.drop_duplicates(subset=['id','x','y']).copy()
1000 loops, best of 3: 658 µs per loop

这篇关于计算数据帧中记录之间的时差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆