检查 Pandas 数据框列中的重复值 [英] Check for duplicate values in Pandas dataframe column

查看：89 发布时间：2021/6/13 20:27:07 python pandas dataframe duplicates

本文介绍了检查 Pandas 数据框列中的重复值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 Pandas 中有没有办法检查数据框列是否有重复值，而不实际删除行?我有一个删除重复行的函数，但是，我只希望它在以下情况下运行特定列中实际上存在重复项.

目前，我将列中唯一值的数量与行数进行比较:如果唯一值少于行，则存在重复值并且代码运行.

 if len(df['Student'].unique()) <长度(df.index):# 根据运行日期列删除重复项的代码

是否有更简单或更有效的方法来检查特定列中是否存在重复值，使用 Pandas?

我正在处理的一些示例数据(只显示了两列).如果找到重复项，则另一个函数确定要保留哪一行(日期最早的行):

 学生日期0 乔 2017 年 12 月1 詹姆斯 2018 年 1 月2 鲍勃 2018 年 4 月3 乔 2017 年 12 月4 杰克 2018 年 2 月5 杰克 2018 年 3 月

解决方案

主要问题

<块引用>

列中是否存在重复值，真/假?

╔=========╦==============╗║ 学生 ║ 日期 ║╠＝＝＝＝＝＝＝＝＝╬＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝╣║ 乔 ║ 2017 年 12 月 ║╠＝＝＝＝＝＝＝＝＝╬＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝╣║ 鲍勃 ║ 2018 年 4 月 ║╠＝＝＝＝＝＝＝＝＝╬＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝╣║ 乔 ║ 2018 年 12 月 ║╚=========╩==============╝

假设上面的数据框 (df)，我们可以通过以下方式快速检查 Student 列中是否有重复:

boolean = not df["Student"].is_unique # True(归功于@Carsten)boolean = df['Student'].duplicated().any() # 真

进一步阅读和参考

上面我们使用的是 Pandas 系列方法之一.pandas DataFrame 有几个有用的方法，两个其中:

drop_duplicates(self[, subset, keep, inplace]) - 返回删除重复行的数据帧，可选择仅考虑某些列.
重复(self[,subset,keep]) - 返回表示重复行的布尔系列，可选择仅考虑某些列.

这些方法可以作为一个整体应用在DataFrame上，而不是像上面那样只是一个Serie(列).相当于:

boolean = df.duplicated(subset=['Student']).any() # True# 我们期待 True，因为乔可以被看到两次.

但是，如果我们对整个框架感兴趣，我们可以继续这样做:

boolean = df.duplicated().any() # Falseboolean = df.duplicated(subset=['Student','Date']).any() # False# 我们在这里期待 False - 没有重复的行# IE.乔 2017 年 12 月，乔 2018 年 12 月

还有最后一个有用的提示.通过使用 keep 参数，我们通常可以跳过几行直接访问我们需要的内容:

<块引用>

保持:{'first','last', False}，默认'first'

first : 除第一次出现外，删除重复项.
last : 除了最后一次出现之外，删除重复项.
False :删除所有重复项.

玩的例子

将pandas导入为pd导入 io数据 = '''\学生,日期乔，2017 年 12 月鲍勃，2018 年 4 月乔，2018 年 12 月'''df = pd.read_csv(io.StringIO(data), sep=',')# 方法 1:简单的真/假boolean = df.duplicated(subset=['Student']).any()print(boolean, end='\n\n') # 真# 方法二:先存储布尔数组，检查再删除duplicate_in_student = df.duplicated(subset=['Student'])如果duplicate_in_student.any():打印(df.loc[~duplicate_in_student], end='\n\n')# 方法三:使用 drop_duplicates 方法df.drop_duplicates(subset=['Student'], inplace=True)打印(df)

退货

真学生日期0 乔 2017 年 12 月1 鲍勃 2018 年 4 月学生日期0 乔 2017 年 12 月1 鲍勃 2018 年 4 月

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

 if len(df['Student'].unique()) < len(df.index):
    # Code to remove duplicates based on Date column runs

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

    Student Date
0   Joe     December 2017
1   James   January 2018
2   Bob     April 2018
3   Joe     December 2017
4   Jack    February 2018
5   Jack    March 2018

解决方案

Main question

Is there a duplicate value in a column, True/False?

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

Example to play around with

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

这篇关于检查 Pandas 数据框列中的重复值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检查 Pandas 数据框列中的重复值 [英] Check for duplicate values in Pandas dataframe column

问题描述

主要问题

进一步阅读和参考

玩的例子

Main question

Further reading and references

Example to play around with

相关文章

Python最新文章

热门教程

热门工具

登录关闭

检查 Pandas 数据框列中的重复值 [英] Check for duplicate values in Pandas dataframe column

问题描述

主要问题

进一步阅读和参考

玩的例子

Main question

Further reading and references

Example to play around with

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭