Python/Pandas-ValueError:索引包含重复的条目,无法重塑 [英] Python/Pandas - ValueError: Index contains duplicate entries, cannot reshape
问题描述
我有一个称为"bal"的数据框.看起来像这样:
I have a dataframe called 'bal'. It looks like this:
ano id unit period
business_id
9564 2012 302 sdasd anual
9564 2011 303 sdasd anual
2361 2013 304 sdasd anual
2361 2012 305 sdasd anual
...
我正在运行以下代码:
bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)
我的意图是将其变成这样:
My intention is to turn that into something like this:
ano id unit period 2006 2007 2008 2009 2010 \
business_id
72 2013 774 sdasd anual NaN NaN NaN NaN NaN
72 2012 775 sdasd anual NaN NaN NaN NaN NaN
74 2012 1120 sdasd anual NaN NaN NaN NaN NaN
119 2013 875 sdasd anual NaN NaN NaN NaN NaN
119 2012 876 sdasd anual NaN NaN NaN NaN NaN
...
当我编写该代码时,出现此错误:
When I that code, I get this error:
ValueError: Index contains duplicate entries, cannot reshape
因此,为了避免重复,我添加了一个drop_duplicates行:
So to avoid duplicates, I added a drop_duplicates line:
bal=bal.drop_duplicates()
bal=bal.merge(bal.pivot(columns='ano', values='id'),right_index=True,left_index=True)
运行代码时,我遇到了同样的问题:
When I run the code, voilá, I get the same problem:
ValueError: Index contains duplicate entries, cannot reshape
我是在做错什么还是误解了什么?
Am I doing something wrong or misunderstanding something?
编辑
bal
是我正在使用以下代码从SQL创建的数据框:
bal
is a dataframe I'm creating out of a SQL using the following code:
bal=pd.read_sql('select * from table;',connection).set_index('business_id')[['ano','id','unit','period']]
奇怪的是,如果我限制SQL查询,它将很好地工作:
The weird thing is that if I limit the SQL query it works fine:
bal=pd.read_sql('select * from table limit 1000;',connection).set_index('business_id')[['ano','id','unit','period']]
我认为问题可能与索引有很多重复这一事实有关(如您在上面的示例中所见).但是,如果我print(bal.head(4))
在这个受限的bal中看起来与上面看到的完全相同,并且索引重复.
I thought that the problem could be related to the fact that the index has a lot of duplication (as you can see in that example above). However if I print(bal.head(4))
in this limited bal it looks exactly the same as the one you can see above, with indexes that repeat.
推荐答案
UPDATE2:
qry = "select distinct business_id,ano,id,unit,period from table where period='anual'"
bal=pd.read_sql(qry, connection, index_col=['business_id'])
假设我们得到以下DF(在ano
列中仍然有重复的值):
assume we get the following DF (still with duplicated values in the ano
column):
In [167]: bal
Out[167]:
ano id unit period
business_id
9564 2012 302 sdasd anual
9564 2012 299 sdasd anual
9564 2011 303 sdasd anual
2361 2013 304 sdasd anual
2361 2012 305 sdasd anual
我们可以这样做:
In [169]: bal.join(bal.pivot_table(index=bal.index, columns='ano',
values='id', aggfunc='first'))
Out[169]:
ano id unit period 2011 2012 2013
business_id
2361 2013 304 sdasd anual NaN 305.0 304.0
2361 2012 305 sdasd anual NaN 305.0 304.0
9564 2012 302 sdasd anual 303.0 302.0 NaN
9564 2012 299 sdasd anual 303.0 302.0 NaN
9564 2011 303 sdasd anual 303.0 302.0 NaN
更新:
请考虑以下示例DF:
In [161]: bal
Out[161]:
ano id unit period
business_id
9564 2012 302 sdasd anual
9564 2012 299 sdasd anual # i've intentionally added this row with duplicated `ano`
9564 2011 303 sdasd anual
2361 2013 304 sdasd anual
2361 2012 305 sdasd anual
重现您的错误:
In [162]: bal.pivot(columns='ano', values='id')
...
skipped
...
ValueError: Index contains duplicate entries, cannot reshape
旧答案:
这就是你想要的吗?
In [144]: bal.join(bal.pivot(columns='ano', values='id'))
Out[144]:
ano id unit period 2011 2012 2013
business_id
2361 2013 304 sdasd anual NaN 305.0 304.0
2361 2012 305 sdasd anual NaN 305.0 304.0
9564 2012 302 sdasd anual 303.0 302.0 NaN
9564 2011 303 sdasd anual 303.0 302.0 NaN
这篇关于Python/Pandas-ValueError:索引包含重复的条目,无法重塑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!