df.duplicated() 误报? [英] df.duplicated() false positives?
问题描述
我有一个包含 2,865,044 个条目的数据框,具有 3 级多索引
I have a dataframe that holds 2,865,044 entries with a 3-level MultiIndex
MultiIndex.levels.names = ['year', 'country', 'productcode']
我正在尝试重塑数据框以生成宽数据框,但出现错误:
I am trying to reshape the dataframe to produce a wide dataframe but I am getting the error:
ReshapeError: Index contains duplicate entries, cannot reshape
我用过:
data[data.duplicated()]
识别导致错误的行,但它列出的数据似乎不包含任何重复项.
to identify the lines causing the error but the data that it lists doesn't seem to contain any duplicates.
这导致我使用 to_csv() 导出数据框并在 Stata 中打开数据并使用重复列表命令查找数据集不包含重复项(根据 stata).
This led me to export my dataframe using the to_csv() and opened the data in Stata and used the duplicates list command to find the dataset doesn't hold duplicates (according to stata).
来自已排序 csv 文件的示例:
An Example from the sorted csv file:
year country productcode duplicate
1962 MYS 711 FALSE
1962 MYS 712 TRUE
1962 MYS 721 FALSE
我知道这是一个很长的镜头,但想知道是什么导致了这种情况?每个索引列中的数据类型为 ['year': int;国家":str,产品代码":str].难道熊猫是如何定义独特的群体的?列出违规索引行的更好方法是什么?
I know it's a long shot but ideas what might be causing this? The data types in each index column is ['year': int; 'country': str, 'productcode' :str]. Could it be how pandas defines the unique groups? Any better ways to list the offending index lines?
更新:我试过重置索引
temp = data.reset_index()
dup = temp[temp.duplicated(cols=['year', 'country', 'productcode'])]
我得到了一个完全不同的列表!
and I get a completely different list!
year country productcode
1994 HKG 9710
1994 USA 9710
1995 HKG 9710
1995 USA 9710
更新 2 [2013 年 6 月 28 日]:
在我的 IPython 会话期间,这似乎是一个奇怪的内存问题.今天早上的新实例,似乎工作正常,无需对昨天的代码进行任何调整即可重塑数据!如果问题再次出现,我将进一步调试并让您知道.有人知道 IPython Sessions 的优秀调试器吗?
It appears to have been a strange memory issue during my IPython Session. This morning's fresh instance, seems to work fine and reshape the data without any adjustments to yesterday's code! I will debug further if the issue returns and let you know. Anyone know of a good debugger for IPython Sessions?
推荐答案
也许试试
cleaned = df.reset_index().drop_duplicates(df.index.names)
cleaned.set_index(df.index.names, inplace=True)
我认为索引中应该有duplicated
方法,目前还没有
I think there ought to be a duplicated
method in the index, there is not yet
https://github.com/pydata/pandas/issues/4060
这篇关于df.duplicated() 误报?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!