重建索引数据帧的问题:重新索引仅对唯一有价值的索引对象有效 [英] problems with reindexing dataframes: Reindexing only valid with uniquely valued Index objects
问题描述
基本上,当我加载数据框时:
eurusd = pd.DataFrame.load('EUR_USD_30Min.df') .drop_duplicates()。dropna()
eurusd
< class'pandas.core.frame.DataFrame'>
DatetimeIndex:119710条目,2003-02-02 17:30:00至2012-12-28 17:00:00
数据列:
打开119710非空值
高119710非空值
low 119710非空值
关闭119710非空值
dtypes:float64(4)
然后我尝试在更大的日期范围内重建索引:
newindex = pd.DateRange(datetime.datetime(2002,1,1),datetime.datetime(2012,12,31),offset = pd.datetools.Minute(30))
newindex
< class'pandas.tseries.index.DatetimeIndex'>
[2002-01-01 00:00:00,...,2012-12-31 00:00:00]
长度:192817,频率:30T,时区:无
当尝试重新索引数据框时,我会感到奇怪的行为。如果我重新索引数据集的大部分,我会收到以下错误:
eurusd [29558:29560] .reindex(index = newindex )
异常:重新索引仅对唯一有价值的索引对象有效
但是,如果我对上述两个子集进行相同的操作,我不会收到错误:
这是第一个子集,没有问题,
eurusd [29558:29559] .reindex(index = newindex)
< class'pandas.core.frame。 DataFrame'>
DatetimeIndex:192817条目,2002-01-01 00:00:00至2012-12-31 00:00:00
频率:30T
数据列:
打开1非空值
高1非空值
低1非空值
关闭1非空值
dtypes:float64(4)
,这里是第二个子集,仍然没有问题,
eurusd [29559:29560] .reindex(index = newindex)
< class'pandas.core.frame.DataFrame'>
DatetimeIndex:192817条目,2002-01-01 00:00:00至2012-12-31 00:00:00
频率:30T
数据列:
打开1非空值
高1非空值
低1非空值
关闭1非空值
dtypes:float64(4)
我真的很疯狂,不明白这个原因。看起来数据帧是重复的干净和重复的索引....如果需要,我可以为数据框提供pickle文件。
您可以通过索引进行分组并获取第一个条目(请参阅 docs ):
df.groupby(level = 0)。第一()
示例:
在[1]中:df = pd.DataFrame([[1],[2]],index = [1,1])$ b
$ b在[2] df
输出[2]:
0
1 1
1 2
在[3]中:df.groupby(level = 0) ()
出[3]:
0
1 1
I am having a real strange behaviour when trying to reindex a dataframe in pandas. My version of Pandas is 0.10.0 and I use Python 2.7. Basically, when I load a dataframe:
eurusd = pd.DataFrame.load('EUR_USD_30Min.df').drop_duplicates().dropna()
eurusd
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 119710 entries, 2003-02-02 17:30:00 to 2012-12-28 17:00:00
Data columns:
open 119710 non-null values
high 119710 non-null values
low 119710 non-null values
close 119710 non-null values
dtypes: float64(4)
and then I try to reindex inside a larger date range:
newindex = pd.DateRange(datetime.datetime(2002,1,1), datetime.datetime(2012,12,31), offset=pd.datetools.Minute(30))
newindex
<class 'pandas.tseries.index.DatetimeIndex'>
[2002-01-01 00:00:00, ..., 2012-12-31 00:00:00]
Length: 192817, Freq: 30T, Timezone: None
I get strange behaviour when trying to reindex the dataframe. If I reindex one larger part of the dataset I get this error:
eurusd[29558:29560].reindex(index=newindex)
Exception: Reindexing only valid with uniquely valued Index objects
But, if I do the same for two subsets of the data above, I don't get the error:
Here's the first subset, with no problems,
eurusd[29558:29559].reindex(index=newindex)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 192817 entries, 2002-01-01 00:00:00 to 2012-12-31 00:00:00
Freq: 30T
Data columns:
open 1 non-null values
high 1 non-null values
low 1 non-null values
close 1 non-null values
dtypes: float64(4)
and here's the second subset, still no problems,
eurusd[29559:29560].reindex(index=newindex)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 192817 entries, 2002-01-01 00:00:00 to 2012-12-31 00:00:00
Freq: 30T
Data columns:
open 1 non-null values
high 1 non-null values
low 1 non-null values
close 1 non-null values
dtypes: float64(4)
I am really going crazy about this, and cannot understand the reasons of this. It seems like the dataframe is 'clean' from duplicates, and duplicated indexes.... I can provide the pickle file for the dataframe if you want.
You could groupby the index and take the first entry (see docs):
df.groupby(level=0).first()
Example:
In [1]: df = pd.DataFrame([[1], [2]], index=[1, 1])
In [2]: df
Out[2]:
0
1 1
1 2
In [3]: df.groupby(level=0).first()
Out[3]:
0
1 1
这篇关于重建索引数据帧的问题:重新索引仅对唯一有价值的索引对象有效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!