df.duplicated() 误报? [英] df.duplicated() false positives?

查看：94 发布时间：2021/6/13 20:48:08 python pandas

本文介绍了df.duplicated() 误报?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含 2,865,044 个条目的数据框，具有 3 级多索引

I have a dataframe that holds 2,865,044 entries with a 3-level MultiIndex

MultiIndex.levels.names = ['year', 'country', 'productcode']

我正在尝试重塑数据框以生成宽数据框，但出现错误:

I am trying to reshape the dataframe to produce a wide dataframe but I am getting the error:

ReshapeError: Index contains duplicate entries, cannot reshape

我用过:

data[data.duplicated()]

识别导致错误的行，但它列出的数据似乎不包含任何重复项.

to identify the lines causing the error but the data that it lists doesn't seem to contain any duplicates.

这导致我使用 to_csv() 导出数据框并在 Stata 中打开数据并使用重复列表命令查找数据集不包含重复项(根据 stata).

This led me to export my dataframe using the to_csv() and opened the data in Stata and used the duplicates list command to find the dataset doesn't hold duplicates (according to stata).

来自已排序 csv 文件的示例:

An Example from the sorted csv file:

year country productcode duplicate
1962    MYS     711       FALSE
1962    MYS     712       TRUE
1962    MYS     721       FALSE

我知道这是一个很长的镜头，但想知道是什么导致了这种情况?每个索引列中的数据类型为 ['year': int;国家":str，产品代码":str].难道熊猫是如何定义独特的群体的?列出违规索引行的更好方法是什么?

I know it's a long shot but ideas what might be causing this? The data types in each index column is ['year': int; 'country': str, 'productcode' :str]. Could it be how pandas defines the unique groups? Any better ways to list the offending index lines?

更新:我试过重置索引

temp = data.reset_index()
dup = temp[temp.duplicated(cols=['year', 'country', 'productcode'])]

我得到了一个完全不同的列表！

and I get a completely different list!

year    country productcode
1994      HKG      9710
1994      USA      9710
1995      HKG      9710
1995      USA      9710

更新 2 [2013 年 6 月 28 日]:

在我的 IPython 会话期间，这似乎是一个奇怪的内存问题.今天早上的新实例，似乎工作正常，无需对昨天的代码进行任何调整即可重塑数据！如果问题再次出现，我将进一步调试并让您知道.有人知道 IPython Sessions 的优秀调试器吗?

It appears to have been a strange memory issue during my IPython Session. This morning's fresh instance, seems to work fine and reshape the data without any adjustments to yesterday's code! I will debug further if the issue returns and let you know. Anyone know of a good debugger for IPython Sessions?

df.duplicated() 误报? [英] df.duplicated() false positives?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

df.duplicated() 误报? [英] df.duplicated() false positives?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭