使用“列表列"中的数据转换pandas df成一个长格式的时间序列.使用三列:[数据列表] + [时间戳] + [持续时间] [英] Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration]

查看:50
本文介绍了使用“列表列"中的数据转换pandas df成一个长格式的时间序列.使用三列:[数据列表] + [时间戳] + [持续时间]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目的是将具有列表列作为数据列(因此每行只有一个时间戳和持续时间)的数据帧转换为长格式的时间序列,每个项目都有一个日期时间索引.

在结果中,数据的每行不再有序列/列表,只有一个值列.

df_test = pd.DataFrame({'timestamp': [1462352000000000000, 1462352100000000000, 1462352200000000000, 14623000000]'列表': [[1,2,1,9], [2,2,3,0], [1,3,3,0], [1,1,3,9]],'duration_sec': [3.0, 3.0, 3.0, 3.0]})tdi = pd.DatetimeIndex(df_test.timestamp)df_test.set_index(tdi,就地=真)df_test.drop(columns='timestamp', inplace=True)df_test.index.name = '日期时间索引'

出:

 列表 duration_sec日期时间索引2016-05-04 08:53:20 [1, 2, 1, 9] 3.02016-05-04 08:55:00 [2, 2, 3, 0] 3.02016-05-04 08:56:40 [1, 3, 3, 0] 3.02016-05-04 08:58:20 [1, 1, 3, 9] 3.0

目标是:

 值日期时间索引2016-05-04 08:53:20 12016-05-04 08:53:21 22016-05-04 08:53:22 12016-05-04 08:53:23 92016-05-04 08:55:00 22016-05-04 08:55:01 22016-05-04 08:55:02 32016-05-04 08:55:03 02016-05-04 08:56:40 12016-05-04 08:56:41 32016-05-04 08:56:42 32016-05-04 08:56:43 02016-05-04 08:58:20 12016-05-04 08:58:21 12016-05-04 08:58:22 32016-05-04 08:58:23 9

请注意,这意味着不仅仅是为每个项目花费 1 秒,这只是为了简化示例.相反,它是一个序列中的大约 4 个项目,它们具有例如 3.0 秒的给定持续时间(也可能因行而异),并且其中每个序列的第一个项目总是从时间 0"开始.这意味着每个项目的秒数应计算为 [3.0 秒/(4-1) 个项目] = 1 秒.

####

上下文:

该示例显示了对 Datetimeindex 的转换,因为这使其适用于 season_decompose(),请参见

现在上面的 df_test 也需要同样的东西.

解决方案

使用

The aim is to convert a dataframe with a list column as the data column (and thus with just one timestamp and duration per row) into a time series in long format with a datetimeindex for each single item.

In the result, there is no sequence / list per row for the data anymore, but just one value column.

df_test = pd.DataFrame({'timestamp': [1462352000000000000, 1462352100000000000, 1462352200000000000, 1462352300000000000],
                'list': [[1,2,1,9], [2,2,3,0], [1,3,3,0], [1,1,3,9]],
                'duration_sec': [3.0, 3.0, 3.0, 3.0]})

tdi = pd.DatetimeIndex(df_test.timestamp)
df_test.set_index(tdi, inplace=True)
df_test.drop(columns='timestamp', inplace=True)
df_test.index.name = 'datetimeindex'

Out:

                             list  duration_sec
datetimeindex                                      
2016-05-04 08:53:20  [1, 2, 1, 9]           3.0
2016-05-04 08:55:00  [2, 2, 3, 0]           3.0
2016-05-04 08:56:40  [1, 3, 3, 0]           3.0
2016-05-04 08:58:20  [1, 1, 3, 9]           3.0

The aim is:

                     value
datetimeindex
2016-05-04 08:53:20  1
2016-05-04 08:53:21  2
2016-05-04 08:53:22  1
2016-05-04 08:53:23  9
2016-05-04 08:55:00  2
2016-05-04 08:55:01  2
2016-05-04 08:55:02  3
2016-05-04 08:55:03  0
2016-05-04 08:56:40  1
2016-05-04 08:56:41  3
2016-05-04 08:56:42  3
2016-05-04 08:56:43  0
2016-05-04 08:58:20  1
2016-05-04 08:58:21  1
2016-05-04 08:58:22  3
2016-05-04 08:58:23  9

Mind that this means not just to take 1 second for each item, this was just taken to simplify the example. Instead, it is about 4 items in a sequence which have a given duration of for example 3.0 seconds (which may also vary from row to row), and where the first item of each sequence always starts at "time 0" meaning that the seconds per item should be calculated like [3.0 sec / (4-1) items] = 1 sec.

####

Context:

The example shows a conversion to Datetimeindex since this makes it suitable for seasonal_decompose(), see https://www.machinelearningplus.com/time-series/time-series-analysis-python/ first search hit.

There, the resulting df looks like this:

df_test2 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')

Out:

                value
date                 
1991-07-01   3.526591
1991-08-01   3.180891
1991-09-01   3.252221
1991-10-01   3.611003
1991-11-01   3.565869
              ...
2008-02-01  21.654285
2008-03-01  18.264945
2008-04-01  23.107677
2008-05-01  22.912510
2008-06-01  19.431740

[204 rows x 1 columns]

And then it is easy to apply a seaonal_decompose() --> Additive Decomposition:

result_add = seasonal_decompose(df_test2['value'], model='additive', extrapolate_trend='freq')

Plot:

plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

Now the same is needed for the df_test above.

解决方案

Use DataFrame.explode first and then add counter by GroupBy.cumcount and to_timedelta to df.index:

df_test = df_test.explode('nestedList')
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='s')

print (df_test)
                    nestedList  duration_sec
2016-05-04 08:53:20          1           3.0
2016-05-04 08:53:21          2           3.0
2016-05-04 08:53:22          1           3.0
2016-05-04 08:53:23          9           3.0
2016-05-04 08:55:00          2           3.0
2016-05-04 08:55:01          2           3.0
2016-05-04 08:55:02          3           3.0
2016-05-04 08:55:03          0           3.0
2016-05-04 08:56:40          1           3.0
2016-05-04 08:56:41          3           3.0
2016-05-04 08:56:42          3           3.0
2016-05-04 08:56:43          0           3.0
2016-05-04 08:58:20          1           3.0
2016-05-04 08:58:21          1           3.0
2016-05-04 08:58:22          3           3.0
2016-05-04 08:58:23          9           3.0

EDIT:

df_test = df_test.explode('nestedList') 
sizes = df_test.groupby(level=0)['nestedList'].transform('size').sub(1)
duration = df_test['duration_sec'].div(sizes) 
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount() * duration, unit='s') 

EDIT2 by asker:

With the resulting df this simple application of decompose() is now possible, which was the final aim:

result_add = seasonal_decompose(x=df_test['nestedList'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/2))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

这篇关于使用“列表列"中的数据转换pandas df成一个长格式的时间序列.使用三列:[数据列表] + [时间戳] + [持续时间]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆