计算数据帧中记录之间的时差 [英] calculating delta time between records in dataframe
问题描述
id xy时间
1 x1 y1 10
1 x1 y1 12
1 x2 y2 14
2 x4 y4 8
2 x5 y5 12
我正在尝试一些东西,如
id xy time delta
1 x1 y1 10 4
1 x2 y2 14 0
2 x4 y4 8 4
2 x5 y5 12 0
我使用自定义UDTF完成了使用HiveQL的这种类型的处理,但正在考虑如何通过DataFrame实现这一点(可能是在R,熊猫,PySpark)。理想情况下,我正在为Python大熊猫和pyspark寻找解决方案。
任何提示都是赞赏,谢谢你的时间!
我想你需要 drop_duplicates
与 groupby
与 DataFrameGroupBy.diff
, 转移
和 fillna
:
df1 = df.drop_duplicates(subset = ['id','x','y'])。copy()
df1 ['delta'] = df1.groupby(['id'])['time']。 diff() ft(-1).fillna(0)
最终代码:
import pandas as pd df = pd.read_csv(sampleInput.txt,
header = None,
usecols = [0,1 ,2,3],
names = ['id','x','y','time'],
sep =\t)
delta = df.groupby(['id','x','y'])。first()。reset_index()
delta ['delta'] = delta.groupby('id')['time '] .diff()。shift(-1).fillna(0)
在[111]中:%timeit df.groupby(['id','x' y'])first()。reset_index()
100循环,最好3:2.42 ms每循环
在[112]:%timeit df.drop_duplicates(subset = [' id','x','y'])copy()
1000循环,最好是3:658μs每循环
I have an interesting problem, I am trying to calculate the delta time between records done at different locations.
id x y time
1 x1 y1 10
1 x1 y1 12
1 x2 y2 14
2 x4 y4 8
2 x5 y5 12
I am trying to get some thing like
id x y time delta
1 x1 y1 10 4
1 x2 y2 14 0
2 x4 y4 8 4
2 x5 y5 12 0
I have done this type of processing with HiveQL by using custom UDTF but was thinking how can I achieve this with DataFrame in general (may it be in R, Pandas, PySpark). Ideally, I am trying to find a solution for Python pandas and pyspark.
Any hint is appreciated, thank you for your time !
I think you need drop_duplicates
with groupby
with DataFrameGroupBy.diff
, shift
and fillna
:
df1 = df.drop_duplicates(subset=['id','x','y']).copy()
df1['delta'] = df1.groupby(['id'])['time'].diff().shift(-1).fillna(0)
Final code:
import pandas as pd df = pd.read_csv("sampleInput.txt",
header=None,
usecols=[0,1,2,3],
names=['id','x','y','time'],
sep="\t")
delta = df.groupby(['id','x','y']).first().reset_index()
delta['delta'] = delta.groupby('id')['time'].diff().shift(-1).fillna(0)
Timings:
In [111]: %timeit df.groupby(['id','x','y']).first().reset_index()
100 loops, best of 3: 2.42 ms per loop
In [112]: %timeit df.drop_duplicates(subset=['id','x','y']).copy()
1000 loops, best of 3: 658 µs per loop
这篇关于计算数据帧中记录之间的时差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!