使用Pandas TimeSeries编码变量 [英] Coding variables with Pandas TimeSeries

查看:122
本文介绍了使用Pandas TimeSeries编码变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我在上一个问题中苦苦挣扎的事情的后续行动,我已经进行了很长时间的研究,以分析来自熊猫追踪实验中的一些非常复杂的行为数据.

As a follow up to something I was struggling with in a previous question, I've been working for a long time on an analysis of some pretty complicated behavioural data from a mouse-tracking experiment in Pandas.

我数据的相关子集如下:

A relevant subset of my data looks like this:

data.iloc[0]

time_stamp                                     21/11/2013 13:06
subject                                                 1276270
trial                                                         0
stimuli                                                      14
resp                                                          2
rt                                                         1145
x             [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
y             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
t             [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
Name: 0, dtype: object

其中xyt是鼠标坐标和时间戳的一维numpy数组.

where, x, y, and t are 1D numpy arrays of mouse coordinates and timestamps.

我想将熊猫的大量资源用于时间序列数据,以将这些坐标转换和分析为TimeSeries对象.我将它们转换为TimeSeries对象(rxry,它们都具有通过将时间戳插入20毫秒的时间间隔生成的索引)没有问题.

I wanted to use Pandas' considerable resources for time series data to transform and analyse these coordinates as TimeSeries objects. I have no problem converting them to TimeSeries objects (rx and ry, each with indexes generated by interpolating the timestamps into 20 msec intervals.

data.rx.iloc[0]

0     -0
20     0
40     0
60     0
80     0
100    0
120    0
140    0
160    0
180    0
200    0
220    0
240    0
260    0
280    0
...
2720    1
2740    1
2760    1
2780    1
2800    1
2820    1
2840    1
2860    1
2880    1
2900    1
2920    1
2940    1
2960    1
2980    1
3000    1
Length: 151, dtype: float64

但是,这种方法在DataFrame的每一行上嵌套2个TimeSeries,绝对不是习惯用法(请参阅

However, this approach, with 2 TimeSeries nested on each row of the DataFrame, definitely isn't idiomatic (see this question); although I have been able to do quite a bit with it, I feel I'm going against Pandas, and making life difficult for myself.

正确的方法是将rxry存储为独立的数据结构,或者将302列添加到我现有的data中,rx.

The proper approach, I think, would be to either store rx and ry as independent data structures, or add 302 columns to my existing data, one for each time step in rx and ry.

第一种方法的问题是我无法访问我的分类数据(即subjectstimuliresp列,以及我在此处遗漏的其他列),而问题出在第二个问题是,我最终得到一个DataFrame宽达数千列的列(对于我应用的每个变换,该列又宽了:每一步的速度,每一步的角度等),并且没有访问特定时间序列的有用方法(即我目前所称的data.rx.mean().plot().

The problem with the first approach is that I have no way of accessing my categorical data (i.e. the subject, stimuli, and resp columns, amongst others I've left out here), while the problem with the second is that I end up with a DataFrame thousands of columns wide (and wider again for each transformation I apply: velocity at each step, angle at each step, etc), and no useful way of accessing specific time serieses (i.e. what I've been currently calling as data.rx.mean().plot().

所有这些实际上只是我的问题的序言:

All of this is really just preamble to my question, which is this:

Pandas或任何其他python库是否提供了一种处理大量时间序列数据的方式,同时保留了它们随附的编码数据?

谢谢

Eoin

推荐答案

有人通过电子邮件询问我是否找到了我想在此处解决的方法,所以我将分享我一直在做的事情.日期.这可能不是使用pandas的规范方法,但对我来说已经足够了.

I've been asked via email if I ever found a solution to what I wanted to do here, so I'm sharing what I've been doing to date. This might not be the canonical way of using pandas, but it's sufficed for me.

简而言之,我已经将我的数据分为几个数据帧. 第一个data与上面相同,但是我只使用与单个值相对应的列,例如trialstimuliresprt.

In short, I've split my data into a couple of data frames. The first, data, is as above, but I only use the columns which correspond to single values, like trial, stimuli, resp, and rt.

对于我的时间序列数据,我使用了两个附加数据帧,一个用于x坐标数据,一个用于y.尽管可能有一种更优雅的生成方式,但是我的代码执行以下操作.

For my time series data, I use two additional data frames, one for the x-coordinate data, and one for the y. Although there's probably a more elegant way of generating these, my code does the following.

data.iloc[0]

    time_stamp                                     21/11/2013 13:06
    subject                                                 1276270
    trial                                                         0
    stimuli                                                      14
    resp                                                          2
    rt                                                         1145
    x             [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
    y             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
    t             [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
    Name: 0, dtype: object

data['nx'], data['ny'] = zip(*
     [even_time_steps(x, y, t)
     for x, y, t, in zip(data.x, data.y, data.t)])
     # Using function even_time_steps from package squeak
     # https://github.com/EoinTravers/Squeak 
     # Simpler applications could use
     # data['nx'] = [pd.TimeSeries(x) for y in data['x']]
     # data['ny'] = [pd.TimeSeries(x) for y in data['y']]

# Seperate DataFrames
nx = pd.concat(list(data.nx), axis=1).T
ny = pd.concat(list(data.ny), axis=1).T

# Remove redundant columns
redundant = ['nx', 'ny', 'x', 'y'] # etc...
data = data.drop(redundant, axis=1)

# Important - reindex data
data.index = range(len(data)) # 0, 1, 2, ..., len(data)

现在data包含我的所有编码信息,nx我的所有x坐标信息和ny我的y坐标信息.

Now data contains all my coding information, nx all my x-coordinate information, and ny my y coordinate information.

nx.head()

       0    1    2    3    4    5    6    7    8        9     ...          91
    0    0    0    0    0    0    0    0    0    0  0.00000   ...     0.953960   
    1    0    0    0    0    0    0    0    0    0  0.00099   ...     1.000000   
    2    0    0    0    0    0    0    0    0    0  0.00000   ...     1.010000   
    3    0    0    0    0    0    0    0    0    0  0.00000   ...     0.870396   
    4    0    0    0    0    0    0    0    0    0  0.00000   ...     1.000000   

             92        93        94       95        96   97   98   99   100  
    0  0.993564  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    1  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    2  1.010000  1.008812  1.003960  1.00000  1.000000    1    1    1    1  
    3  0.906238  0.936931  0.973564  0.98604  0.993366    1    1    1    1  
    4  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  

    [5 rows x 101 columns]

最后,根据data中存储的编码变量,选择xy数据的特定子集,我只取相关数据子集的index

Finally, to select specific subsets of the x and y data, according to coding variables stored in data, I just take the index of the relevant subset of data

subject1_index = data[data.subject==1].index
print subject1_index

    Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
    18, 19, 20,  21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
    36, 37, 38, 39], dtype='int64')

,并使用iloc方法选择匹配的nxny子集.

and select a matching subset of nx and ny using the iloc method.

sub1_x = nx.iloc[subject1_index]
sub1_y = ny.iloc[subject1_index]
for i in subject1_index:
    plt.plot(nx.iloc[i], ny.iloc[i], 'r', alpha=.3)
plt.plot(sub1_x.mean(), sub1_y.mean(), 'r', linewidth=2)

编辑:为完整起见,请注意,我的很多分析都需要很长时间 格式化数据(并在R中执行).再次,可能会有更优雅的 这样做的方式(使用后果自负!),但是我的代码有效(请注意,这是来自不同数据集的真实代码, 而且我没有费心去更改变量名称以匹配原始示例):

EDIT: For completeness, note that a lot of my analysis requires long format data (and is carried out in R). Again, there may be a more elegant way of doing this (so use at your own risk!), but my code goes (note, this is real code, from a different dataset, and I haven't bothered to change the variable names to match the original example):

# Long format data
wide_data = data.copy()
steps = nx.columns
for i in steps:
    wide_data['nx_%i' % i] = nx[i]
    wide_data['ny_%i' % i] = ny[i]

id_vars = ['subject_nr', 'condition', 'count_trial_sequence',
    'trial_id', 'choice', 'accuracy']

# Long data with 'nx' as the variable
long_data = pd.melt(wide_data, id_vars=id_vars, value_vars = ['nx_%i' % i for i in steps])
long_data['step'] = long_data.variable.map(lambda s: int(s[3:]))
long_data['nx'] = long_data.value

# Same with 'ny'
tmp_long = pd.melt(wide_data, id_vars=id_vars, value_vars = ['ny_%i' % i for i in steps])
# Combine in single data frame
long_data['ny'] = tmp_long['value']
del tmp_long

long_data = long_data.drop(['variable', 'value'], axis=1)
long_data.to_csv(os.path.join('data', 'long_data.csv'))

long_data.head()
Out[41]: 
       subject_nr      condition  count_trial_sequence  trial_id choice accuracy  
    0   505250022              A                     0        13   rsp1     True   
    1   505250022              A                     1        16   rsp1     True   
    2   505250022              B                     2         2   rsp2    False   
    3   505250022              B                     3         0   rsp1    False   
    4   505250022              C                     4        33   rsp2    False   

       step  nx  ny  
    0     0   0   0  
    1     0   0   0  
    2     0   0   0  
    3     0   0   0  
    4     0   0   0  

这篇关于使用Pandas TimeSeries编码变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆