df.duplicated() 误报? [英] df.duplicated() false positives?

查看:94
本文介绍了df.duplicated() 误报?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 2,865,044 个条目的数据框,具有 3 级多索引

I have a dataframe that holds 2,865,044 entries with a 3-level MultiIndex

MultiIndex.levels.names = ['year', 'country', 'productcode']

我正在尝试重塑数据框以生成宽数据框,但出现错误:

I am trying to reshape the dataframe to produce a wide dataframe but I am getting the error:

ReshapeError: Index contains duplicate entries, cannot reshape

我用过:

data[data.duplicated()]

识别导致错误的行,但它列出的数据似乎不包含任何重复项.

to identify the lines causing the error but the data that it lists doesn't seem to contain any duplicates.

这导致我使用 to_csv() 导出数据框并在 Stata 中打开数据并使用重复列表命令查找数据集不包含重复项(根据 stata).

This led me to export my dataframe using the to_csv() and opened the data in Stata and used the duplicates list command to find the dataset doesn't hold duplicates (according to stata).

来自已排序 csv 文件的示例:

An Example from the sorted csv file:

year country productcode duplicate
1962    MYS     711       FALSE
1962    MYS     712       TRUE
1962    MYS     721       FALSE

我知道这是一个很长的镜头,但想知道是什么导致了这种情况?每个索引列中的数据类型为 ['year': int;国家":str,产品代码":str].难道熊猫是如何定义独特的群体的?列出违规索引行的更好方法是什么?

I know it's a long shot but ideas what might be causing this? The data types in each index column is ['year': int; 'country': str, 'productcode' :str]. Could it be how pandas defines the unique groups? Any better ways to list the offending index lines?

更新:我试过重置索引

temp = data.reset_index()
dup = temp[temp.duplicated(cols=['year', 'country', 'productcode'])]

我得到了一个完全不同的列表!

and I get a completely different list!

year    country productcode
1994      HKG      9710
1994      USA      9710
1995      HKG      9710
1995      USA      9710

更新 2 [2013 年 6 月 28 日]:

在我的 IPython 会话期间,这似乎是一个奇怪的内存问题.今天早上的新实例,似乎工作正常,无需对昨天的代码进行任何调整即可重塑数据!如果问题再次出现,我将进一步调试并让您知道.有人知道 IPython Sessions 的优秀调试器吗?

It appears to have been a strange memory issue during my IPython Session. This morning's fresh instance, seems to work fine and reshape the data without any adjustments to yesterday's code! I will debug further if the issue returns and let you know. Anyone know of a good debugger for IPython Sessions?

推荐答案

也许试试

cleaned = df.reset_index().drop_duplicates(df.index.names)
cleaned.set_index(df.index.names, inplace=True)

我认为索引中应该有duplicated方法,目前还没有

I think there ought to be a duplicated method in the index, there is not yet

https://github.com/pydata/pandas/issues/4060

这篇关于df.duplicated() 误报?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆