根据开始和结束列扩展数据框(速度) [英] expanding a dataframe based on start and end columns (speed)

查看:48
本文介绍了根据开始和结束列扩展数据框(速度)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个pandas.DataFrame,其中包含startend列,以及几个其他列.我想将此数据帧扩展为一个以start值开始并以end值结束的时间序列,但复制其他列.到目前为止,我提出了以下建议:

I have a pandas.DataFrame containing start and end columns, plus a couple of additional columns. I would like to expand this dataframe into a time series that starts at start values and end at end values, but copying my other columns. So far I came up with the following:

import pandas as pd
import datetime as dt

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

data_series = list()
for row in df.itertuples():
    time_range = pd.bdate_range(row.start, row.end)
    s = len(time_range)
    data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))

columns_names = ['date', 'start', 'end', 'country', 'letter']
df = pd.DataFrame(data_series, columns=columns_names)

开始数据框:

       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-05 2017-04-12      EU      b
2 2017-04-10 2017-04-17      UK      c

所需的输出:

         date      start        end country letter
0  2017-04-03 2017-04-03 2017-04-10      US      a
1  2017-04-04 2017-04-03 2017-04-10      US      a
2  2017-04-05 2017-04-03 2017-04-10      US      a
3  2017-04-06 2017-04-03 2017-04-10      US      a
4  2017-04-07 2017-04-03 2017-04-10      US      a
5  2017-04-10 2017-04-03 2017-04-10      US      a
6  2017-04-05 2017-04-05 2017-04-12      EU      b
7  2017-04-06 2017-04-05 2017-04-12      EU      b
8  2017-04-07 2017-04-05 2017-04-12      EU      b
9  2017-04-10 2017-04-05 2017-04-12      EU      b
10 2017-04-11 2017-04-05 2017-04-12      EU      b
11 2017-04-12 2017-04-05 2017-04-12      EU      b
12 2017-04-10 2017-04-10 2017-04-17      UK      c
13 2017-04-11 2017-04-10 2017-04-17      UK      c
14 2017-04-12 2017-04-10 2017-04-17      UK      c
15 2017-04-13 2017-04-10 2017-04-17      UK      c
16 2017-04-14 2017-04-10 2017-04-17      UK      c
17 2017-04-17 2017-04-10 2017-04-17      UK      c

我的解决方案存在的问题是,将其应用于更大的数据框(主要是行)时,它对我来说还不够快.有人对我如何提高有任何想法吗?我也在考虑numpy中的解决方案.

Problem with my solution is that when applying it to a much bigger dataframe (mostly in terms of rows), it does not achieve a result fast enough for me. Does anybody have any ideas of how I could improve? I am also considering solutions in numpy.

推荐答案

首先,我们可以构建所需的日期,同时通过列表deltas跟踪每行中的天数:

First we can build the dates you need, while keeping track of the number of days in each row via the list deltas:

dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
         for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')

然后使用np.repeat构建具有适当段长度的数据矩阵:

then use np.repeat to build up the data matrix with the proper segment lengths:

df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})

然后将日期插入数据框的前面:

then insert the dates into the front of the dataframe:

df2 = pd.concat([dates, df2], axis=1)

测试代码:

import pandas as pd
import numpy as np
import datetime as dt

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5),
               dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12),
             dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
         for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')

df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
df2 = pd.concat([dates, df2], axis=1)
print(df2)

结果:

         date      start        end country letter
0  2017-04-03 2017-04-03 2017-04-10      US      a
1  2017-04-04 2017-04-03 2017-04-10      US      a
2  2017-04-05 2017-04-03 2017-04-10      US      a
3  2017-04-06 2017-04-03 2017-04-10      US      a
4  2017-04-07 2017-04-03 2017-04-10      US      a
5  2017-04-10 2017-04-03 2017-04-10      US      a
6  2017-04-05 2017-04-05 2017-04-12      EU      b
7  2017-04-06 2017-04-05 2017-04-12      EU      b
8  2017-04-07 2017-04-05 2017-04-12      EU      b
9  2017-04-10 2017-04-05 2017-04-12      EU      b
10 2017-04-11 2017-04-05 2017-04-12      EU      b
11 2017-04-12 2017-04-05 2017-04-12      EU      b
12 2017-04-10 2017-04-10 2017-04-17      UK      c
13 2017-04-11 2017-04-10 2017-04-17      UK      c
14 2017-04-12 2017-04-10 2017-04-17      UK      c
15 2017-04-13 2017-04-10 2017-04-17      UK      c
16 2017-04-14 2017-04-10 2017-04-17      UK      c
17 2017-04-17 2017-04-10 2017-04-17      UK      c

这篇关于根据开始和结束列扩展数据框(速度)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆