删除* NEARLY *重复观测值-Python [英] Removing *NEARLY* Duplicate Observations - Python
问题描述
我正在尝试删除相似度几乎为100%但不太相似的数据框中的一些观察结果.参见下面的框架:
I am attempting to remove some observations in a data frame where the similarities are ALMOST 100% but not quite. See frame below:
注意如何约翰",玛丽"和韦斯利"具有几乎相同的观察结果,但其中一栏是不同的.实际数据集包含15列和215,000多个观测值.在所有我可以通过肉眼验证的情况下,相似之处都是相同的:在15列中,其他观察结果每次最多匹配14列.出于该项目的目的,我决定删除重复的观察结果(并将它们存储到另一个数据框中,以防万一我的老板要求看到它们).
Notice how "John", "Mary", and "Wesley" have nearly identical observations, but have one column being different. The real data set have 15 columns, and 215,000+ observations. In all of the cases I could visually verify, the similarities were likewise: out of 15 columns, the other observation would match up to 14 columns, every time. For the purpose of the project I have decided to remove the repeated observations, (and store them into another data frame just in case my boss asks to see them).
我显然已经考虑过remove_duplicates(keep ='something'),但是由于观察结果并不完全相似,所以这行不通.有没有人遇到过这样的问题?有补救办法吗?
I have evidently thought of remove_duplicates(keep='something'), but that would not work since the observations are not ENTIRELY similar. Has anyone ever encounter such an issue? Any idea on a remedy?
推荐答案
关于列子集的简单循环又如何:
What about a simple loop over subset of columns :
import pandas as pd
df = pd.DataFrame(
[
['John', 45, 85000, 'DC'],
['Netcha', 25, 48000, 'NYC'],
['Mary', 45, 85000, 'DC'],
['Wesley', 36, 72500, 'LA'],
['Porter', 22, 98750, 'Seattle'],
['John', 45, 105500, 'DC'],
['Mary', 28, 85000, 'DC'],
['Wesley', 36, 72500, 'Boston'],
],
columns=['Name', 'Age', 'Salary', 'City'])
cols = df.columns.tolist()
cols.remove('Name')
for col in cols:
observed_cols = df.drop(col, axis=1).columns.tolist()
df.drop_duplicates(observed_cols, keep='first', inplace=True)
print(df)
返回:
Name Age Salary City
0 John 45 85000 DC
1 Netcha 25 48000 NYC
2 Mary 45 85000 DC
3 Wesley 36 72500 LA
4 Porter 22 98750 Seattle
这篇关于删除* NEARLY *重复观测值-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!