按日期拆分或合并操作 [英] Split or merge actions by date

查看:72
本文介绍了按日期拆分或合并操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我喜欢基于相同或不同日期的不同活动(ACT)创建一个序列数据库.如您所见,某些行可能包含NaN值.我需要最终数据来训练一系列活动的机器学习模型.

I like to create a sequence database, based on different activities (ACT) on same or different dates. As you can see, some rows may contain NaN values. I need the final data to train a machine learning model on sequences of activities.

ID  ACT1        ACT2        ACT3        ACT4        ACT5    
0   2015-08-11  2015-08-16  2015-08-16  2015-09-22  2015-08-19
1   2014-07-16  2014-07-16  2014-09-16  NaT         2014-09-12
2   2016-07-16  NaT         2017-09-16  2017-09-16  2017-12-16

预期的输出将根据日期值进行拆分或合并,如下表所示:

The expected output, which will split or merge based on the date values, would look like following table:

ID Sequence1  Sequence2  Sequence3  Sequence4  
0  ACT1       ACT2,ACT3  ACT5       ACT4
1  ACT1,ACT2  ACT5       ACT3
2  ACT1       ACT3,ACT4  ACT5

以下脚本将仅输出具有整个序列的字符串:

Following script will output a string with the whole sequence only:

df['Sequence'] = df.loc[:, cols].apply(lambda dr: ','.join(df.loc[:, cols].columns[dr.dropna().argsort()]), axis=1)

Sequence
ACT1,ACT2,ACT3,ACT5,ACT4
ACT1,ACT2,ACT5,ACT3
ACT1,ACT3,ACT4,ACT5

推荐答案

这很有挑战性,但我相信这对您有用.

This was challenging, but I believe this will work for you.

from collections import defaultdict
import pandas as pd

data = {
      'ACT1': [pd.Timestamp(year=2015, month=8, day=11),
               pd.Timestamp(year=2014, month=7, day=16),
               pd.Timestamp(year=2016, month=7, day=16)],
      'ACT2': [pd.Timestamp(year=2015, month=8, day=16),
               pd.Timestamp(year=2014, month=7, day=16),
               np.nan],
      'ACT3': [pd.Timestamp(year=2015, month=8, day=16),
               pd.Timestamp(year=2014, month=9, day=16),
               pd.Timestamp(year=2017, month=9, day=16)],
      'ACT4': [pd.Timestamp(year=2015, month=9, day=22),
               np.nan, 
               pd.Timestamp(year=2017, month=9, day=16)],
      'ACT5': [pd.Timestamp(year=2015, month=8, day=19),
               pd.Timestamp(year=2014, month=9, day=12),
               pd.Timestamp(year=2017, month=12, day=16)]}

df = pd.DataFrame(data)

# Unstack so we can create groups
unstacked = df.unstack().reset_index()

# This will keep track of our sequence data
sequences = defaultdict(list)

# Here we get our groups, e.g., 'ACT1,ACT2', etc.;
# We group by date first, then by original index (0,1,2)
for i, g in unstacked.groupby([0, 'level_1']):
    sequences[i[1]].append(','.join(g.level_0))

# How many sequences (columns) we're going to need
n_seq = len(max(sequences.values(), key=len))

# Any NaTs will always shift your data to the left,
# so to speak, so we need to right pad the rows 
for k in sequences:
    while len(sequences[k]) < n_seq:
        sequences[k].append('')

# Create column labels and make new dataframe
columns = ['Sequence{}'.format(i) for i in range(1, n_seq + 1)]
print pd.DataFrame(list(sequences.values()), columns=columns)

   Sequence1  Sequence2 Sequence3 Sequence4
0       ACT1  ACT2,ACT3      ACT5      ACT4
1  ACT1,ACT2       ACT5      ACT3          
2       ACT1  ACT3,ACT4      ACT5   

这篇关于按日期拆分或合并操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆