使用Pandas TimeSeries编码变量 [英] Coding variables with Pandas TimeSeries
问题描述
作为我在上一个问题中苦苦挣扎的事情的后续行动,我已经进行了很长时间的研究,以分析来自熊猫追踪实验中的一些非常复杂的行为数据.
As a follow up to something I was struggling with in a previous question, I've been working for a long time on an analysis of some pretty complicated behavioural data from a mouse-tracking experiment in Pandas.
我数据的相关子集如下:
A relevant subset of my data looks like this:
data.iloc[0]
time_stamp 21/11/2013 13:06
subject 1276270
trial 0
stimuli 14
resp 2
rt 1145
x [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
y [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
t [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
Name: 0, dtype: object
其中x
,y
和t
是鼠标坐标和时间戳的一维numpy数组.
where, x
, y
, and t
are 1D numpy arrays of mouse coordinates and timestamps.
我想将熊猫的大量资源用于时间序列数据,以将这些坐标转换和分析为TimeSeries
对象.我将它们转换为TimeSeries
对象(rx
和ry
,它们都具有通过将时间戳插入20毫秒的时间间隔生成的索引)没有问题.
I wanted to use Pandas' considerable resources for time series data to transform and analyse these coordinates as TimeSeries
objects. I have no problem converting them to TimeSeries
objects (rx
and ry
, each with indexes generated by interpolating the timestamps into 20 msec intervals.
data.rx.iloc[0]
0 -0
20 0
40 0
60 0
80 0
100 0
120 0
140 0
160 0
180 0
200 0
220 0
240 0
260 0
280 0
...
2720 1
2740 1
2760 1
2780 1
2800 1
2820 1
2840 1
2860 1
2880 1
2900 1
2920 1
2940 1
2960 1
2980 1
3000 1
Length: 151, dtype: float64
但是,这种方法在DataFrame
的每一行上嵌套2个TimeSeries
,绝对不是习惯用法(请参阅
However, this approach, with 2 TimeSeries
nested on each row of the DataFrame
, definitely isn't idiomatic (see this question); although I have been able to do quite a bit with it, I feel I'm going against Pandas, and making life difficult for myself.
正确的方法是将rx
和ry
存储为独立的数据结构,或者将302列添加到我现有的data
中,rx
和
The proper approach, I think, would be to either store rx
and ry
as independent data structures, or add 302 columns to my existing data
, one for each time step in rx
and ry
.
第一种方法的问题是我无法访问我的分类数据(即subject
,stimuli
和resp
列,以及我在此处遗漏的其他列),而问题出在第二个问题是,我最终得到一个DataFrame
宽达数千列的列(对于我应用的每个变换,该列又宽了:每一步的速度,每一步的角度等),并且没有访问特定时间序列的有用方法(即我目前所称的data.rx.mean().plot()
.
The problem with the first approach is that I have no way of accessing my categorical data (i.e. the subject
, stimuli
, and resp
columns, amongst others I've left out here), while the problem with the second is that I end up with a DataFrame
thousands of columns wide (and wider again for each transformation I apply: velocity at each step, angle at each step, etc), and no useful way of accessing specific time serieses (i.e. what I've been currently calling as data.rx.mean().plot()
.
所有这些实际上只是我的问题的序言:
All of this is really just preamble to my question, which is this:
Pandas或任何其他python库是否提供了一种处理大量时间序列数据的方式,同时保留了它们随附的编码数据?
谢谢
Eoin
推荐答案
有人通过电子邮件询问我是否找到了我想在此处解决的方法,所以我将分享我一直在做的事情.日期.这可能不是使用pandas
的规范方法,但对我来说已经足够了.
I've been asked via email if I ever found a solution to what I wanted to do here, so I'm sharing what I've been doing to date. This might not be the canonical way of using pandas
, but it's sufficed for me.
简而言之,我已经将我的数据分为几个数据帧.
第一个data
与上面相同,但是我只使用与单个值相对应的列,例如trial
,stimuli
,resp
和rt
.
In short, I've split my data into a couple of data frames.
The first, data
, is as above, but I only use the columns which correspond to single values, like trial
, stimuli
, resp
, and rt
.
对于我的时间序列数据,我使用了两个附加数据帧,一个用于x坐标数据,一个用于y.尽管可能有一种更优雅的生成方式,但是我的代码执行以下操作.
For my time series data, I use two additional data frames, one for the x-coordinate data, and one for the y. Although there's probably a more elegant way of generating these, my code does the following.
data.iloc[0]
time_stamp 21/11/2013 13:06
subject 1276270
trial 0
stimuli 14
resp 2
rt 1145
x [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0....
y [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
t [1, 26, 26, 35, 45, 55, 65, 75, 85, 95, 105, 1...
Name: 0, dtype: object
data['nx'], data['ny'] = zip(*
[even_time_steps(x, y, t)
for x, y, t, in zip(data.x, data.y, data.t)])
# Using function even_time_steps from package squeak
# https://github.com/EoinTravers/Squeak
# Simpler applications could use
# data['nx'] = [pd.TimeSeries(x) for y in data['x']]
# data['ny'] = [pd.TimeSeries(x) for y in data['y']]
# Seperate DataFrames
nx = pd.concat(list(data.nx), axis=1).T
ny = pd.concat(list(data.ny), axis=1).T
# Remove redundant columns
redundant = ['nx', 'ny', 'x', 'y'] # etc...
data = data.drop(redundant, axis=1)
# Important - reindex data
data.index = range(len(data)) # 0, 1, 2, ..., len(data)
现在data
包含我的所有编码信息,nx
我的所有x坐标信息和ny
我的y坐标信息.
Now data
contains all my coding information, nx
all my x-coordinate information, and ny
my y coordinate information.
nx.head()
0 1 2 3 4 5 6 7 8 9 ... 91
0 0 0 0 0 0 0 0 0 0 0.00000 ... 0.953960
1 0 0 0 0 0 0 0 0 0 0.00099 ... 1.000000
2 0 0 0 0 0 0 0 0 0 0.00000 ... 1.010000
3 0 0 0 0 0 0 0 0 0 0.00000 ... 0.870396
4 0 0 0 0 0 0 0 0 0 0.00000 ... 1.000000
92 93 94 95 96 97 98 99 100
0 0.993564 1.000000 1.000000 1.00000 1.000000 1 1 1 1
1 1.000000 1.000000 1.000000 1.00000 1.000000 1 1 1 1
2 1.010000 1.008812 1.003960 1.00000 1.000000 1 1 1 1
3 0.906238 0.936931 0.973564 0.98604 0.993366 1 1 1 1
4 1.000000 1.000000 1.000000 1.00000 1.000000 1 1 1 1
[5 rows x 101 columns]
最后,根据data
中存储的编码变量,选择x
和y
数据的特定子集,我只取相关数据子集的index
Finally, to select specific subsets of the x
and y
data, according to coding variables stored in data
, I just take the index
of the relevant subset of data
subject1_index = data[data.subject==1].index
print subject1_index
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39], dtype='int64')
,并使用iloc
方法选择匹配的nx
和ny
子集.
and select a matching subset of nx
and ny
using the iloc
method.
sub1_x = nx.iloc[subject1_index]
sub1_y = ny.iloc[subject1_index]
for i in subject1_index:
plt.plot(nx.iloc[i], ny.iloc[i], 'r', alpha=.3)
plt.plot(sub1_x.mean(), sub1_y.mean(), 'r', linewidth=2)
编辑:为完整起见,请注意,我的很多分析都需要很长时间 格式化数据(并在R中执行).再次,可能会有更优雅的 这样做的方式(使用后果自负!),但是我的代码有效(请注意,这是来自不同数据集的真实代码, 而且我没有费心去更改变量名称以匹配原始示例):
EDIT: For completeness, note that a lot of my analysis requires long format data (and is carried out in R). Again, there may be a more elegant way of doing this (so use at your own risk!), but my code goes (note, this is real code, from a different dataset, and I haven't bothered to change the variable names to match the original example):
# Long format data
wide_data = data.copy()
steps = nx.columns
for i in steps:
wide_data['nx_%i' % i] = nx[i]
wide_data['ny_%i' % i] = ny[i]
id_vars = ['subject_nr', 'condition', 'count_trial_sequence',
'trial_id', 'choice', 'accuracy']
# Long data with 'nx' as the variable
long_data = pd.melt(wide_data, id_vars=id_vars, value_vars = ['nx_%i' % i for i in steps])
long_data['step'] = long_data.variable.map(lambda s: int(s[3:]))
long_data['nx'] = long_data.value
# Same with 'ny'
tmp_long = pd.melt(wide_data, id_vars=id_vars, value_vars = ['ny_%i' % i for i in steps])
# Combine in single data frame
long_data['ny'] = tmp_long['value']
del tmp_long
long_data = long_data.drop(['variable', 'value'], axis=1)
long_data.to_csv(os.path.join('data', 'long_data.csv'))
long_data.head()
Out[41]:
subject_nr condition count_trial_sequence trial_id choice accuracy
0 505250022 A 0 13 rsp1 True
1 505250022 A 1 16 rsp1 True
2 505250022 B 2 2 rsp2 False
3 505250022 B 3 0 rsp1 False
4 505250022 C 4 33 rsp2 False
step nx ny
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
这篇关于使用Pandas TimeSeries编码变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!