出现性能问题,将行以开始-结束转换为具有TimeIndex的数据帧 [英] Performance issue turning rows with start - end into a dataframe with TimeIndex

查看:46
本文介绍了出现性能问题,将行以开始-结束转换为具有TimeIndex的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的数据集,其中每行代表一个时间间隔(在开始和结束之间)的某种类型的值(认为是传感器). 看起来像这样:

I have a large dataset where each line represents the value of a certain type (think a sensor) for a time interval (between start and end). It looks like this:

    start       end    type value
2015-01-01  2015-01-05  1   3
2015-01-06  2015-01-08  1   2
2015-01-05  2015-01-08  3   3
2015-01-13  2015-01-16  2   1

我想将其转换为这样的每日时间索引帧:

I want to turn it into a daily time-indexed frame like this:

day       type  value
2015-01-01  1   3
2015-01-02  1   3
2015-01-03  1   3
2015-01-04  1   3
2015-01-05  1   3
2015-01-06  1   2
2015-01-07  1   2
2015-01-08  1   2
2015-01-05  3   3
2015-01-16  3   3
2015-01-07  3   3
2015-01-08  3   3
2015-01-13  2   1
2015-01-14  2   1
2015-01-15  2   1
2015-01-16  2   1

(请注意,我们无法对间隔进行任何假设:它们应该是连续且不重叠的,但是我们不能保证)

(Note that we cannot make any assumption regarding the interval: they should be contiguous and non-overlapping but we cannot guarantee that)

基于这些堆栈溢出答案[1]( pandas:基于开始/结束日期进行聚合),似乎存在两种方法:一种围绕itupuples,一种围绕融化(2种在已使用的堆叠/未堆叠之上,但类似于融化). 让我们比较一下它们的性能.

Based on these Stack Overflow answers [1] (DataFrame resample on date ranges) [2] (pandas: Aggregate based on start/end date), there seem to exist two methods: one around itertuples, one around melt (2 above used stack/unstack but it is similar to melt). Let us compare them for performance.

# Creating a big enough dataframe
date_range = pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2019,12,31), freq='4D')
to_concat = []
for val in range(1,50):
    frame_tmp = pd.DataFrame()
    frame_tmp['start'] = date_range
    frame_tmp['end'] = frame_tmp['start']+ dt.timedelta(3)
    frame_tmp['type'] = val
    frame_tmp['value'] = np.random.randint(1, 6, frame_tmp.shape[0])
    to_concat.append(frame_tmp)
df = pd.concat(to_concat, ignore_index=True)

# Method 1 
def method_1(df):
    df1 = (pd.concat([pd.Series(r.Index,
                                pd.date_range(r.start,
                                              r.end,
                                              freq='D'))
                      for r in df.itertuples()])) \
        .reset_index()
    df1.columns = ['start_2', 'idx']

    df2 = df1.set_index('idx').join(df).reset_index(drop=True)

    return df2.set_index('start_2')

df_method_1=df.groupby(['type']).apply(method_1)

# Method 2
df_tmp= df.reset_index()
df1 = (df_tmp.melt(df_tmp.columns.difference(['start','end']),
          ['start', 'end'],
          value_name='current_time')
  )
df_method_2 = df1.set_index('current_time').groupby('index', group_keys=False)\
.resample('D').ffill()

在Jupyter中使用%%timeit时,对于示例所定义的数据帧,方法1耗时约8s,方法2耗时约25s.这太慢了,因为我要处理的实际数据集比这个要大得多.在该数据帧上,方法1大约需要20分钟.

With %%timeit in Jupyter, method 1 takes ~8s and method 2 takes ~25s for the dataframe defined as example. This is way too slow as the real dataset that I am dealing with is much bigger than this. On that dataframe, method 1 takes ~20 minutes.

您对如何更快地执行此操作有任何想法吗?

Do you have any idea on how to make this faster?

推荐答案

这比您的method_1快1.7倍,并且更整洁:

This is about 1.7 times faster than your method_1 and a little tidier:

df_expand = pd.DataFrame.from_records(
    (
        (d, r.type, r.value) 
        for r in df.itertuples()
        for d in pd.date_range(start=r.start, end=r.end, freq='D')
    ),
    columns=['day', 'type', 'row']
)

您可以通过创建自己的日期范围而不是调用pd.date_range()来获得大约7倍的速度:

You can get about 7 times faster by creating your own date range instead of calling pd.date_range():

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(
    (
        (r.start + i * one_day, r.type, r.value) 
        for r in df.itertuples()
        for i in range(int((r.end-r.start)/one_day)+1)
    ),
    columns=['day', 'type', 'row']
)

或者使用numpy的arange函数生成日期,您最多可以快24倍:

Or you can get up to 24 times faster by using numpy's arange function to generate the dates:

one_day = dt.timedelta(1)
df_expand = pd.DataFrame.from_records(
    (
        (d, r.type, r.value) 
        for r in df.itertuples()
        for d in np.arange(r.start.date(), r.end.date()+one_day, dtype='datetime64[D]')
    ),
    columns=['day', 'type', 'row']
)

我忍不住要添加一个比上一个快两倍的速度.不幸的是,它很难阅读.这将根据读数的间隔天数("dur")对读数进行分组,然后使用向量化的numpy操作将每个组扩展为​​一个批次.

I couldn't resist adding one more that is a little more than twice as fast as the last one. Unfortunately, it's a lot harder to read. This groups the readings based on how many days they span ('dur'), then uses vectorized numpy operations to expand each group in a single batch.

def expand_group(g):
    dur = g.dur.iloc[0] # how many days for each reading in this group?
    return pd.DataFrame({
        'day': (g.start.values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)).ravel(),
        'type': np.repeat(g.type.values, dur),
        'value': np.repeat(g.value.values, dur),
    })
# take all readings with the same duration and process them together using vectorized code
df_expand = (
    df.assign(dur=(df['end']-df['start']).dt.days + 1)
    .groupby('dur').apply(expand_group)
    .reset_index('dur', drop=True)
)

更新:为回应您的评论,以下是向量化方法的简化版本,它更快,更易于阅读.无需使用groupby步骤,而是使单个矩阵与最长的读数一样宽,然后过滤掉不需要的条目.除非您的最长阅读时间比平均时间长得多,否则这应该非常有效.使用测试数据框(所有读数持续4天),这比groupby解决方案快15倍,比method_1解决方案快700倍.

Update: Responding to your comment, below is a simplified version of the vectorized approach, which is faster and easier to read. Instead of using the groupby step, this makes a single matrix as wide as the longest reading, then filters out the unneeded entries. This should be pretty efficient unless the maximum duration for your readings is much longer than the average. With the test dataframe (all readings lasting 4 days), this is about 15x faster than the groupby solution and about 700x faster than method_1.

dur = (df['end']-df['start']).max().days + 1
df_expand = pd.DataFrame({
    'day': (
        df['start'].values[:,None] + np.timedelta64(1, 'D') * np.arange(dur)
    ).ravel(),
    'type': np.repeat(df['type'].values, dur),
    'value': np.repeat(df['value'].values, dur),
    'end': np.repeat(df['end'].values, dur),
})
df_expand = df_expand.loc[df_expand['day']<=df_expand['end'], 'day':'value']

这篇关于出现性能问题,将行以开始-结束转换为具有TimeIndex的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆