如何有效处理 pandas 中的时间序列数据 [英] How to Efficiently Process Time-Series Data in Pandas

查看:108
本文介绍了如何有效处理 pandas 中的时间序列数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据集,它们表示经过给定节点的旅行时间.数据在每个节点的一个CSV文件中,格式如下: node name, datetime, irrelevant field, mac address

I have data sets representing travel times past given nodes. The data is in one CSV file per node in this format: node name, datetime, irrelevant field, mac address

我正在将它们读入Pandas中的一个DataFrame中:

I'm reading them into one DataFrame in Pandas:

dfs = [pd.read_csv(f, names=CSV_COLUMNS, parse_dates=[1]) for f in files]
return pd.concat(dfs)

我想做的是找到一个节点上的MAC地址与下一个节点上的MAC地址之间的时间差.现在,我正在遍历生成的DataFrame,这效率不高且无法正常工作:我尝试对数据进行排序的每种方式都会导致问题.

What I want to do is find the time difference between a MAC address' appearance at one node and the next. Right now I'm looping over the resulting DataFrame, which isn't efficient and isn't working: every way I've tried to sort the data causes a problem.

  • 我无法按MAC,日期和时间对其进行排序,因为我需要保留行进方向(按日期和时间进行排序会导致各个方向看起来都朝着正方向移动).
  • 仅通过MAC进行排序即可使节点保持顺序(因为它们按节点顺序被推送到文件中)

虽然我可能能够弄清排序问题,但更大的问题是我是Pandas的新手,我敢打赌Pandas有正确的方法可以做到这一点.在处理结束时,我想要的是一个数据集,该数据集显示MAC 直接在之间移动的每对节点的行进时间(timediff.total_seconds()或类似结果).最后一点很重要:对于节点为A,B和C的布局,大多数行程将为AB或BC(或者相反),但是某些MAC可能不会在B处注册,而是会从A变为C也可能是某些孤儿出现,其中MAC出现在一个节点上,但从不出现在另一个节点上.

While I may be able to figure out the sorting problem, the larger issue is I'm new to Pandas and I bet there's a right way to do this in Pandas. What I want at the end of processing is a data set that shows travel time (timediff.total_seconds() or similar) for every pair of nodes that a MAC traveled directly between. That last bit is important: for a layout where the nodes are A, B and C, most travel will be A-B or B-C (or the reverse), but it is possible some MACs won't register at B and will go A to C. It's also possible some of the appearances will be orphans where a MAC appears at a node but never shows up at another node.

推荐答案

如果按日期时间为每个mac address排序数据框,则可以执行以下操作:

if the dataframe is sorted by datetime for each mac address, probably you can do:

grb = df.groupby('mac address')
df['origin'] = grb['node name'].transform(pd.Series.shift, 1)
df['departure time'] = grb['datetime'].transform(pd.Series.shift, 1)

出行时间为:

df['travel time'] = df['departure time'] - df['datetime']

,如果节点名称为字符串,则路径为:

and if node names are string, the path would be:

df['path'] = df['origin'] + '-' + df['node name']

edit :如果旅行时间不能为负,这可能会更快:

edit: this may be faster assuming travel times cannot be negative:

df.sort(['mac address', 'datetime'], inplace=True)

df['origin'] = df['node name'].shift(1)
df['departure time'] = df['datetime'].shift(1)

# correct for the places where the mac addresses change
idx = df['mac address'] != df['mac address'].shift(1)
df.loc[idx, 'origin'] = np.nan
df.loc[idx, 'departure time'] = np.nan

这篇关于如何有效处理 pandas 中的时间序列数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆