选择由DatetimeIndex索引并带有TimeStamps列表的Pandas DataFrame的子集 [英] Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps
问题描述
我有一只大熊猫DataFrame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3425100 entries, 2011-12-01 00:00:00 to 2011-12-31 23:59:59
Data columns:
sig_qual 3425100 non-null values
heave 3425100 non-null values
north 3425099 non-null values
west 3425097 non-null values
dtypes: float64(4)
我使用.ix[start_datetime:end_datetime]
选择该DataFrame
的子集,并将其传递给 peakdetect函数返回两个单独列表中局部最大值和最小值的索引和值.我提取最大值的索引位置,并使用DataFrame.index
获得熊猫TimeStamps的列表.
.ix[]
来提取大型DataFrame的相关子集,但是它似乎总是返回一个空的DataFrame
.我可以遍历TimeStamps的列表并从DataFrame
获取相关行,但这是一个漫长的过程,我认为ix[]
应该接受根据创建了要点,它的确有效,因为当从csv加载数据时,时间戳的索引列存储为对象的numpy数组,看起来像是字符串.与我自己的代码中索引的类型为<class 'pandas.tseries.index.DatetimeIndex'>
并且每个元素的类型为<class 'pandas.lib.Timestamp'>
不同,我认为传递pandas.lib.Timestamp
的列表与传递单个时间戳的效果相同,这是否会被视为错误? /p>
如果我使用索引作为字符串列表创建原始的DataFrame
,则使用字符串列表进行查询可以正常工作.但这确实会大大增加DataFrame的字节大小.
更新3: 该错误仅出现在非常大的DataFrame上,我在不同大小的DataFrame上重新运行了代码(在下面的注释中有一些详细信息),并且它似乎在270万记录以上的DataFrame上发生.使用字符串而不是TimeStamps可以解决此问题,但会增加内存使用量.
已修复 在最新的github管理员(2012年9月18日)中,请参阅页面底部的Wes的评论.
df.ix [my_list_of_dates]应该可以正常工作.
In [193]: df
Out[193]:
A B C D
2012-08-16 2 1 1 7
2012-08-17 6 4 8 6
2012-08-18 8 3 1 1
2012-08-19 7 2 8 9
2012-08-20 6 7 5 8
2012-08-21 1 3 3 3
2012-08-22 8 2 3 8
2012-08-23 7 1 7 4
2012-08-24 2 6 0 6
2012-08-25 4 6 8 1
In [194]: row_pos = [2, 6, 9]
In [195]: df.ix[row_pos]
Out[195]:
A B C D
2012-08-18 8 3 1 1
2012-08-22 8 2 3 8
2012-08-25 4 6 8 1
In [196]: dates = [df.index[i] for i in row_pos]
In [197]: df.ix[dates]
Out[197]:
A B C D
2012-08-18 8 3 1 1
2012-08-22 8 2 3 8
2012-08-25 4 6 8 1
I have a large Pandas DataFrame
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3425100 entries, 2011-12-01 00:00:00 to 2011-12-31 23:59:59
Data columns:
sig_qual 3425100 non-null values
heave 3425100 non-null values
north 3425099 non-null values
west 3425097 non-null values
dtypes: float64(4)
I select a subset of that DataFrame
using .ix[start_datetime:end_datetime]
and I pass this to a peakdetect function which returns the index and value of the local maxima and minima in two seperate lists. I extract the index position of the maxima and using DataFrame.index
I get a list of pandas TimeStamps.
I then attempt to extract the relevant subset of the large DataFrame by passing the list of TimeStamps to .ix[]
but it always seems to return an empty DataFrame
. I can loop over the list of TimeStamps and get the relevant rows from the DataFrame
but this is a lengthy process and I thought that ix[]
should accept a list of values according to the docs?
(Although I see that the example for Pandas 0.7 uses a numpy.ndarray
of numpy.datetime64
)
Update: A small 8 second subset of the DataFrame is selected below, # lines show some of the values:
y = raw_disp['heave'].ix[datetime(2011,12,30,0,0,0):datetime(2011,12,30,0,0,8)]
#csv representation of y time-series
2011-12-30 00:00:00,-310.0
2011-12-30 00:00:01,-238.0
2011-12-30 00:00:01.500000,-114.0
2011-12-30 00:00:02.500000,60.0
2011-12-30 00:00:03,185.0
2011-12-30 00:00:04,259.0
2011-12-30 00:00:04.500000,231.0
2011-12-30 00:00:05.500000,139.0
2011-12-30 00:00:06.500000,55.0
2011-12-30 00:00:07,-49.0
2011-12-30 00:00:08,-144.0
index = y.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-12-30 00:00:00, ..., 2011-12-30 00:00:08]
Length: 11, Freq: None, Timezone: None
#_max returned from the peakdetect function, one local maxima for this 8 seconds period
_max = [[5, 259.0]]
indexes = [x[0] for x in _max]
#[5]
timestamps = [index[z] for z in indexes]
#[<Timestamp: 2011-12-30 00:00:04>]
print raw_disp.ix[timestamps]
#Empty DataFrame
#Columns: array([sig_qual, heave, north, west, extrema], dtype=object)
#Index: <class 'pandas.tseries.index.DatetimeIndex'>
#Length: 0, Freq: None, Timezone: None
for timestamp in timestamps:
print raw_disp.ix[timestamp]
#sig_qual 0
#heave 259
#north 27
#west 132
#extrema 0
#Name: 2011-12-30 00:00:04
Update 2:
I created a gist, which actually works because when the data is loaded in from csv the index columns of timestamps are stored as numpy array of objects which appear to be strings. Unlike in my own code where the index is of type <class 'pandas.tseries.index.DatetimeIndex'>
and each element is of type <class 'pandas.lib.Timestamp'>
, I thought passing a list of pandas.lib.Timestamp
would work the same as passing individual timestamps, would this be considered a bug?
If I create the original DataFrame
with the index as a list of strings, querying with a list of strings works fine. It does increase the byte size of the DataFrame significantly though.
Update 3: The error only appears to occur with very large DataFrames, I reran the code on varying sizes of DataFrame ( some detail in a comment below ) and it appears to occur on a DataFrame above 2.7 million records. Using strings as opposed to TimeStamps resolves the issue but increases memory usage.
Fixed In latest github master (18/09/2012), see comment from Wes at bottom of page.
df.ix[my_list_of_dates] should work just fine.
In [193]: df
Out[193]:
A B C D
2012-08-16 2 1 1 7
2012-08-17 6 4 8 6
2012-08-18 8 3 1 1
2012-08-19 7 2 8 9
2012-08-20 6 7 5 8
2012-08-21 1 3 3 3
2012-08-22 8 2 3 8
2012-08-23 7 1 7 4
2012-08-24 2 6 0 6
2012-08-25 4 6 8 1
In [194]: row_pos = [2, 6, 9]
In [195]: df.ix[row_pos]
Out[195]:
A B C D
2012-08-18 8 3 1 1
2012-08-22 8 2 3 8
2012-08-25 4 6 8 1
In [196]: dates = [df.index[i] for i in row_pos]
In [197]: df.ix[dates]
Out[197]:
A B C D
2012-08-18 8 3 1 1
2012-08-22 8 2 3 8
2012-08-25 4 6 8 1
这篇关于选择由DatetimeIndex索引并带有TimeStamps列表的Pandas DataFrame的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!