使用Pandas TimeSeries编码变量 [英] Coding variables with Pandas TimeSeries

查看：122 发布时间：2020/5/18 23:39:33 python numpy pandas time-series

本文介绍了使用Pandas TimeSeries编码变量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

作为我在上一个问题中苦苦挣扎的事情的后续行动，我已经进行了很长时间的研究，以分析来自熊猫追踪实验中的一些非常复杂的行为数据.

As a follow up to something I was struggling with in a previous question, I've been working for a long time on an analysis of some pretty complicated behavioural data from a mouse-tracking experiment in Pandas.

我数据的相关子集如下:

A relevant subset of my data looks like this:

data.iloc[0]

time_stamp                                     21/11/2013 13:06
subject                                                 1276270
trial                                                         0
stimuli                                                      14
resp                                                          2
rt                                                         1145
x             [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
y             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
t             [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
Name: 0, dtype: object

其中x，y和t是鼠标坐标和时间戳的一维numpy数组.

where, x, y, and t are 1D numpy arrays of mouse coordinates and timestamps.

我想将熊猫的大量资源用于时间序列数据，以将这些坐标转换和分析为TimeSeries对象.我将它们转换为TimeSeries对象(rx和ry，它们都具有通过将时间戳插入20毫秒的时间间隔生成的索引)没有问题.

I wanted to use Pandas' considerable resources for time series data to transform and analyse these coordinates as TimeSeries objects. I have no problem converting them to TimeSeries objects (rx and ry, each with indexes generated by interpolating the timestamps into 20 msec intervals.

data.rx.iloc[0]

0     -0
20     0
40     0
60     0
80     0
100    0
120    0
140    0
160    0
180    0
200    0
220    0
240    0
260    0
280    0
...
2720    1
2740    1
2760    1
2780    1
2800    1
2820    1
2840    1
2860    1
2880    1
2900    1
2920    1
2940    1
2960    1
2980    1
3000    1
Length: 151, dtype: float64

但是，这种方法在DataFrame的每一行上嵌套2个TimeSeries，绝对不是习惯用法(请参阅

However, this approach, with 2 TimeSeries nested on each row of the DataFrame, definitely isn't idiomatic (see this question); although I have been able to do quite a bit with it, I feel I'm going against Pandas, and making life difficult for myself.

正确的方法是将rx和ry存储为独立的数据结构，或者将302列添加到我现有的data中，rx和.

The proper approach, I think, would be to either store rx and ry as independent data structures, or add 302 columns to my existing data, one for each time step in rx and ry.

第一种方法的问题是我无法访问我的分类数据(即subject，stimuli和resp列，以及我在此处遗漏的其他列)，而问题出在第二个问题是，我最终得到一个DataFrame宽达数千列的列(对于我应用的每个变换，该列又宽了:每一步的速度，每一步的角度等)，并且没有访问特定时间序列的有用方法(即我目前所称的data.rx.mean().plot().

The problem with the first approach is that I have no way of accessing my categorical data (i.e. the subject, stimuli, and resp columns, amongst others I've left out here), while the problem with the second is that I end up with a DataFrame thousands of columns wide (and wider again for each transformation I apply: velocity at each step, angle at each step, etc), and no useful way of accessing specific time serieses (i.e. what I've been currently calling as data.rx.mean().plot().

所有这些实际上只是我的问题的序言:

All of this is really just preamble to my question, which is this:

Pandas或任何其他python库是否提供了一种处理大量时间序列数据的方式，同时保留了它们随附的编码数据?

谢谢

Eoin

推荐答案

有人通过电子邮件询问我是否找到了我想在此处解决的方法，所以我将分享我一直在做的事情.日期.这可能不是使用pandas的规范方法，但对我来说已经足够了.

I've been asked via email if I ever found a solution to what I wanted to do here, so I'm sharing what I've been doing to date. This might not be the canonical way of using pandas, but it's sufficed for me.

简而言之，我已经将我的数据分为几个数据帧. 第一个data与上面相同，但是我只使用与单个值相对应的列，例如trial，stimuli，resp和rt.

In short, I've split my data into a couple of data frames. The first, data, is as above, but I only use the columns which correspond to single values, like trial, stimuli, resp, and rt.

对于我的时间序列数据，我使用了两个附加数据帧，一个用于x坐标数据，一个用于y.尽管可能有一种更优雅的生成方式，但是我的代码执行以下操作.

For my time series data, I use two additional data frames, one for the x-coordinate data, and one for the y. Although there's probably a more elegant way of generating these, my code does the following.

data.iloc[0]

    time_stamp                                     21/11/2013 13:06
    subject                                                 1276270
    trial                                                         0
    stimuli                                                      14
    resp                                                          2
    rt                                                         1145
    x             [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
    y             [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
    t             [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
    Name: 0, dtype: object

data['nx'], data['ny'] = zip(*
     [even_time_steps(x, y, t)
     for x, y, t, in zip(data.x, data.y, data.t)])
     # Using function even_time_steps from package squeak
     # https://github.com/EoinTravers/Squeak 
     # Simpler applications could use
     # data['nx'] = [pd.TimeSeries(x) for y in data['x']]
     # data['ny'] = [pd.TimeSeries(x) for y in data['y']]

# Seperate DataFrames
nx = pd.concat(list(data.nx), axis=1).T
ny = pd.concat(list(data.ny), axis=1).T

# Remove redundant columns
redundant = ['nx', 'ny', 'x', 'y'] # etc...
data = data.drop(redundant, axis=1)

# Important - reindex data
data.index = range(len(data)) # 0, 1, 2, ..., len(data)

现在data包含我的所有编码信息，nx我的所有x坐标信息和ny我的y坐标信息.

Now data contains all my coding information, nx all my x-coordinate information, and ny my y coordinate information.

nx.head()

       0    1    2    3    4    5    6    7    8        9     ...          91
    0    0    0    0    0    0    0    0    0    0  0.00000   ...     0.953960   
    1    0    0    0    0    0    0    0    0    0  0.00099   ...     1.000000   
    2    0    0    0    0    0    0    0    0    0  0.00000   ...     1.010000   
    3    0    0    0    0    0    0    0    0    0  0.00000   ...     0.870396   
    4    0    0    0    0    0    0    0    0    0  0.00000   ...     1.000000   

             92        93        94       95        96   97   98   99   100  
    0  0.993564  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    1  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  
    2  1.010000  1.008812  1.003960  1.00000  1.000000    1    1    1    1  
    3  0.906238  0.936931  0.973564  0.98604  0.993366    1    1    1    1  
    4  1.000000  1.000000  1.000000  1.00000  1.000000    1    1    1    1  

    [5 rows x 101 columns]

最后，根据data中存储的编码变量，选择x和y数据的特定子集，我只取相关数据子集的index

Finally, to select specific subsets of the x and y data, according to coding variables stored in data, I just take the index of the relevant subset of data

subject1_index = data[data.subject==1].index
print subject1_index

    Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
    18, 19, 20,  21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
    36, 37, 38, 39], dtype='int64')

，并使用iloc方法选择匹配的nx和ny子集.

and select a matching subset of nx and ny using the iloc method.

sub1_x = nx.iloc[subject1_index]
sub1_y = ny.iloc[subject1_index]
for i in subject1_index:
    plt.plot(nx.iloc[i], ny.iloc[i], 'r', alpha=.3)
plt.plot(sub1_x.mean(), sub1_y.mean(), 'r', linewidth=2)

编辑:为完整起见，请注意，我的很多分析都需要很长时间格式化数据(并在R中执行).再次，可能会有更优雅的这样做的方式(使用后果自负！)，但是我的代码有效(请注意，这是来自不同数据集的真实代码，而且我没有费心去更改变量名称以匹配原始示例):

EDIT: For completeness, note that a lot of my analysis requires long format data (and is carried out in R). Again, there may be a more elegant way of doing this (so use at your own risk!), but my code goes (note, this is real code, from a different dataset, and I haven't bothered to change the variable names to match the original example):

# Long format data
wide_data = data.copy()
steps = nx.columns
for i in steps:
    wide_data['nx_%i' % i] = nx[i]
    wide_data['ny_%i' % i] = ny[i]

id_vars = ['subject_nr', 'condition', 'count_trial_sequence',
    'trial_id', 'choice', 'accuracy']

# Long data with 'nx' as the variable
long_data = pd.melt(wide_data, id_vars=id_vars, value_vars = ['nx_%i' % i for i in steps])
long_data['step'] = long_data.variable.map(lambda s: int(s[3:]))
long_data['nx'] = long_data.value

# Same with 'ny'
tmp_long = pd.melt(wide_data, id_vars=id_vars, value_vars = ['ny_%i' % i for i in steps])
# Combine in single data frame
long_data['ny'] = tmp_long['value']
del tmp_long

long_data = long_data.drop(['variable', 'value'], axis=1)
long_data.to_csv(os.path.join('data', 'long_data.csv'))

long_data.head()
Out[41]: 
       subject_nr      condition  count_trial_sequence  trial_id choice accuracy  
    0   505250022              A                     0        13   rsp1     True   
    1   505250022              A                     1        16   rsp1     True   
    2   505250022              B                     2         2   rsp2    False   
    3   505250022              B                     3         0   rsp1    False   
    4   505250022              C                     4        33   rsp2    False   

       step  nx  ny  
    0     0   0   0  
    1     0   0   0  
    2     0   0   0  
    3     0   0   0  
    4     0   0   0

这篇关于使用Pandas TimeSeries编码变量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Pandas TimeSeries编码变量 [英] Coding variables with Pandas TimeSeries

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Pandas TimeSeries编码变量 [英] Coding variables with Pandas TimeSeries

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭