如何使用pandas数据框创建滞后数据结构 [英] How to create a lagged data structure using pandas dataframe
问题描述
示例
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s
1 5
2 4
3 3
4 2
5 1
是否有创建系列的有效方法.例如在每行中包含滞后值(在此示例中,直到滞后2)
Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3 [3, 4, 5]
4 [2, 3, 4]
5 [1, 2, 3]
这对应于 s = pd.Series([[3,4,5],[2,3,4],[1,2,3]],index = [3,4,5] )
对于时间序列很多且时间很长的数据帧,如何有效地做到这一点?
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
谢谢
看到答案后进行了编辑
好的,最后我实现了这个功能:
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
new_dict={}
for col_name in s:
new_dict[col_name]=s[col_name]
# create lagged Series
for l in range(1,lag+1):
new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
res=pd.DataFrame(new_dict,index=s.index)
elif type(s) is pd.Series:
the_range=range(lag+1)
res=pd.concat([s.shift(i) for i in the_range],axis=1)
res.columns=['lag_%d' %i for i in the_range]
else:
print 'Only works for DataFrame or Series'
return None
if dropna:
return res.dropna()
else:
return res
它产生所需的输出并管理结果DataFrame中列的命名.
it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
对于系列作为输入:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
lag_0 lag_1 lag_2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
,并以DataFrame作为输入:
and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)
a a_lag1 a_lag2 b b_lag1 b_lag2
3 3 4 5 30 40 50
4 2 3 4 20 30 40
5 1 2 3 10 20 30
推荐答案
As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
一种解决方案是 concat 移位系列一起创建一个DataFrame:
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]:
0 1 2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]:
0 1 2
3 3 4 5
4 2 3 4
5 1 2 3
这样做比列表上的工作更有效率...
这篇关于如何使用pandas数据框创建滞后数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!