计算数据帧中记录之间的时差 [英] calculating delta time between records in dataframe

查看：176 发布时间：2017/3/26 3:19:18 python pandas dataframe spark-dataframe pyspark-sql

本文介绍了计算数据帧中记录之间的时差的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个有趣的问题，我试图计算在不同地点完成的记录之间的差值时间。

  id xy时间
 1 x1 y1 10 
 1 x1 y1 12 
 1 x2 y2 14 
 2 x4 y4 8 
 2 x5 y5 12

我正在尝试一些东西，如

  id xy time delta 
 1 x1 y1 10 4 
 1 x2 y2 14 0 
 2 x4 y4 8 4 
 2 x5 y5 12 0

我使用自定义UDTF完成了使用HiveQL的这种类型的处理，但正在考虑如何通过DataFrame实现这一点（可能是在R，熊猫，PySpark）。理想情况下，我正在为Python大熊猫和pyspark寻找解决方案。

任何提示都是赞赏，谢谢你的时间！

解决方案

我想你需要 drop_duplicates 与 groupby 与 DataFrameGroupBy.diff ， 转移 和 fillna ：

  df1 = df.drop_duplicates（subset = ['id'，'x'，'y']）。copy（）
 
 df1 ['delta'] = df1.groupby（['id']）['time']。 diff（） ft（-1）.fillna（0）

最终代码：

  import pandas as pd df = pd.read_csv（sampleInput.txt，
 header = None，
 usecols = [0,1 ，2,3]，
 names = ['id'，'x'，'y'，'time']，
 sep =\t）
 
 delta = df.groupby（['id'，'x'，'y']）。first（）。reset_index（）
 delta ['delta'] = delta.groupby（'id'）['time '] .diff（）。shift（-1）.fillna（0）

 在[111]中：％timeit df.groupby（['id'，'x' y']）first（）。reset_index（）
 100循环，最好3：2.42 ms每循环
 
在[112]：％timeit df.drop_duplicates（subset = [' id'，'x'，'y']）copy（）
 1000循环，最好是3：658μs每循环

I have an interesting problem, I am trying to calculate the delta time between records done at different locations.

id x y time
1  x1 y1 10
1  x1 y1 12
1  x2 y2 14
2  x4 y4 8
2  x5 y5 12

I am trying to get some thing like

id x y time delta
1 x1 y1 10   4
1 x2 y2 14   0
2 x4 y4 8    4
2 x5 y5 12   0

I have done this type of processing with HiveQL by using custom UDTF but was thinking how can I achieve this with DataFrame in general (may it be in R, Pandas, PySpark). Ideally, I am trying to find a solution for Python pandas and pyspark.

Any hint is appreciated, thank you for your time !
解决方案
I think you need drop_duplicates with groupby with DataFrameGroupBy.diff, shift and fillna:
df1 = df.drop_duplicates(subset=['id','x','y']).copy() df1['delta'] = df1.groupby(['id'])['time'].diff().shift(-1).fillna(0)
Final code:
import pandas as pd df = pd.read_csv("sampleInput.txt", header=None, usecols=[0,1,2,3], names=['id','x','y','time'], sep="\t") delta = df.groupby(['id','x','y']).first().reset_index() delta['delta'] = delta.groupby('id')['time'].diff().shift(-1).fillna(0)
Timings:
In [111]: %timeit df.groupby(['id','x','y']).first().reset_index() 100 loops, best of 3: 2.42 ms per loop In [112]: %timeit df.drop_duplicates(subset=['id','x','y']).copy() 1000 loops, best of 3: 658 µs per loop

这篇关于计算数据帧中记录之间的时差的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算数据帧中记录之间的时差 [英] calculating delta time between records in dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

计算数据帧中记录之间的时差 [英] calculating delta time between records in dataframe

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭