删除基于相似度测度 pandas 的数据框行 [英] Drop Dataframe Rows Based on a Similarity Measure Pandas

查看:96
本文介绍了删除基于相似度测度 pandas 的数据框行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想消除数据框中的重复行.

I want to eliminate repeated rows in my dataframe.

我知道drop_duplicates()方法适用于删除具有相同子列值的行.但是我想删除不相同但相似的行.例如,我有以下两行:

I know that that drop_duplicates() method works for dropping rows with identical subcolumn values. However I want to drop rows that aren't identical but similar. For example, I have the following two rows:

       Title        |   Area   |    Price
Apartment at Boston    100         150000
Apt at Boston          105         149000

我希望能够基于一些相似性度量来消除这两列,例如标题,区域和价格相差不到5%.说,我可以删除相似度> 0.95的行.这对于大型数据集特别有用,而不是手动逐行检查.我该如何实现?

I want to be able to eliminate these two columns based on some similarity measure, such as if Title, Area, and Price differ by less than 5%. Say, I could delete rows whose similarity measure > 0.95. This would be particularly useful for large data sets, instead of manually inspecting row by row. How can I achieve this?

推荐答案

看看这是否满足您的需求

See if this meets your needs

Title = ['Apartment at Boston', 'Apt at Boston', 'Apt at Chicago','Apt at   Seattle','Apt at Seattle','Apt at Chicago']
Area = [100, 105, 100, 102,101,101]
Price = [150000, 149000,150200,150300,150000,150000]
data = dict(Title=Title, Area=Area, Price=Price)
df = pd.DataFrame(data, columns=data.keys())

创建的df如下

Title 	Area 	Price
0 	Apartment at Boston 	100 	150000
1 	Apt at Boston 	105 	149000
2 	Apt at Chicago 	100 	150200
3 	Apt at Seattle 	102 	150300
4 	Apt at Seattle 	101 	150000
5 	Apt at Chicago 	101 	150000

现在,我们按以下方式运行代码

Now, we run the code as below

from fuzzywuzzy import fuzz
def fuzzy_compare(a,b):
    val=fuzz.partial_ratio(a,b)
    return val
tl = df["Title"].tolist()
itered=1
i=0
def do_the_thing(i):
    itered=i+1    
    while itered < len(tl):
        val=fuzzy_compare(tl[i],tl[itered])
        if val > 80:
            if abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))>0.94 and abs((df.loc[i,'Area'])/(df.loc[itered,'Area']))<1.05:
                if abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))>0.94 and abs((df.loc[i,'Price'])/(df.loc[itered,'Price']))<1.05:
                    df.drop(itered,inplace=True)
                    df.reset_index()
                    pass
                else:
                    pass
            else:
                pass            
       else:
            pass
       itered=itered+1    
while i < len(tl)-1:
    try:
        do_the_thing(i)
        i=i+1
    except:
        i=i+1
        pass
else:
    pass

输出为df,如下所示.重复波士顿&当模糊匹配大于80&时,删除西雅图项目. Area&的值价格彼此相差不到5%.

the output is df as below. Repeating Boston & Seattle items are removed when fuzzy match is more that 80 & the values of Area & Price are within 5% of each other.

Title 	Area 	Price
0 	Apartment at Boston 	100 	150000
2 	Apt at Chicago 	100 	150200
3 	Apt at Seattle 	102 	150300

这篇关于删除基于相似度测度 pandas 的数据框行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆