如何在 pandas 数据框中进行时间分类 [英] How to bin time in a pandas dataframe

查看:76
本文介绍了如何在 pandas 数据框中进行时间分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分析度量 X中的平均每日波动。使用pandas数据框花费了数周的时间,但是时间戳/日期时间等却被证明特别难以处理。花了好几个小时来解决这个问题,我的代码越来越混乱了,我认为我离解决方案还差得很远,希望这里的人可以指导我正确的方向。


我在不同的时间和不同的日期测量了X,将每日结果保存到以下格式的数据框中:

 时间戳记(datetime64)X 

0 2015-10-05 00:01:38 1
1 2015-10-05 06:03:39 4
2 2015-10-05 13:42:39 3
3 2015-10-05 22:15:39 2

时间我每天都在变化时进行测量,因此我决定使用分级来整理数据,然后计算出每个分级的平均值和STD,然后可以进行绘制。我的想法是创建一个最终的数据框,其中包含bin和用于测量的X平均值, Observations列仅用于帮助理解:

 时间仓观察< X> 

0 00:00-05:59 [1,...] 2.3
1 06:00-11:59 [4,...] 4.6
2 12 :00-17:59 [3,...] 8.5
3 18:00-23:59 [2,...] 3.1

但是我遇到了时间,日期时间,datetime64,timedelta和使用 pd.cut 和<$ c $分仓的不兼容问题。 c> pd.groupby ,基本上,我觉得我在暗中刺刺,不知道解决此问题的正确方法。我能想到的唯一解决方案是遍历数据帧的逐行迭代,但我真的很想避免这样做。

解决方案

每当我按时间范围对时间序列数据进行分箱(这似乎是您在此处所做的事情)时,我都会创建一个小时列并对其进行切片。另外,我通常将索引设置为日期时间值...尽管这里没有必要。

 #假设您的时间戳列标记为ts:
df ['hod'] = [r.hour对于df.ts中的r]

#现在,您可以计算每个bin的统计信息
ave = df [(df.hod> = 0)& (df.hod< 6)] .mean()

我认为有一种使用方法此处是df.resample,但由于您的时间序列中起点/终点定义不明确,我认为这可能需要比上述方法更多的关注。



这是您想要的吗?


I am trying to analyze average daily fluctuations in a measurement "X" over several weeks using pandas dataframes, however timestamps/datetimes etc. are proving particularly hellish to deal with. Having spent a good few hours trying to work this out my code is getting messier and messier and I don't think I'm any closer to a solution, hoping someone here can guide me in the right direction.

I have measured X at different times and on different days, saving the daily results to a dataframe which has the form:

    Timestamp(datetime64)         X 

0    2015-10-05 00:01:38          1
1    2015-10-05 06:03:39          4 
2    2015-10-05 13:42:39          3
3    2015-10-05 22:15:39          2

As the time the measurement is made at changes from day to day I decided to use binning to organize the data, and then work out averages and STD for each bin which I can then plot. My idea was to create a final dataframe with bins and the average value of X for the measurements, the 'Observations' column is just to aid understanding:

        Time Bin       Observations     <X>  

0     00:00-05:59      [ 1 , ...]       2.3
1     06:00-11:59      [ 4 , ...]       4.6
2     12:00-17:59      [ 3 , ...]       8.5
3     18:00-23:59      [ 2 , ...]       3.1

However I've run into difficulties with incompatibility between time, datetime, datetime64, timedelta and binning using pd.cut and pd.groupby, basically I feel like I'm making stabs in the dark with no idea as to the the 'right' way to approach this problem. The only solution I can think of is a row-by-row iteration through the dataframe but I'd really like to avoid having to do this.

解决方案

Whenever I bin time series data by a time range, which seems to be what you are doing here, I just create an "hour of day" column and slice over that. Also, I normally set the index as datetime values...though that is not necessary here.

# assuming your "timestamp" column is labeled ts: 
df['hod'] = [r.hour for r in df.ts]

# now you can calculate stats for each bin
ave = df[ (df.hod>=0) & (df.hod<6) ].mean()

I would think there is a method of using df.resample here, but with the poorly defined starting/ending points in your time series I think this may require more attention than the above method.

Is this along the lines of what you were wanting?

这篇关于如何在 pandas 数据框中进行时间分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆