数据帧中的多个拼写结果1 [英] Multiple Spelling Results in a Dataframe 1

查看:109
本文介绍了数据帧中的多个拼写结果1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些包含拼写错误的数据.我正在更正它们,并使用以下代码对拼写的接近程度进行评分:

I have some data containing spelling errors. I'm correcting them and scoring how close the spelling is using the following code:

 import pandas as pd
 import difflib

 Li_A = ["potato", "tomato", "squash", "apple", "pear"]

 Q    = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
         'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}

 df_Q = pd.DataFrame(Q)

 # Define the function that Corrects & Scores the Spelling
 def Spelling(ask):
     a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)

     # List comprehension for all values of a
     b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
     return pd.Series(a + b)

 # Apply the function that Corrects & Scores the Spelling
 df_A = df_Q['one'].apply(Spelling)

 # Get the column names on the A dataframe
 c = len(df_A.columns) // 2
 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
                ['Score_{}'.format(y)    for y in range(c)]

 # Join the Q & A dataframes
 df_QA = df_Q.join(df_A)

结果如下:

 df_QA
       one     two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4  \
 a  potat0  po1ato     potato     tomato       pear      apple     squash   
 b  toma3o  2omato     tomato     potato       pear      apple     squash   
 c  s5uash  squ0sh     squash       pear      apple     tomato     potato   
 d   ap8le   2pple      apple       pear     tomato     squash     potato   
 e    pea7    p3ar       pear     potato      apple     tomato     squash   

     Score_0   Score_1   Score_2   Score_3   Score_4  
 a  0.833333  0.500000  0.400000  0.181818  0.166667  
 b  0.833333  0.333333  0.200000  0.181818  0.166667  
 c  0.833333  0.200000  0.181818  0.166667  0.166667  
 d  0.800000  0.222222  0.181818  0.181818  0.181818  
 e  0.750000  0.400000  0.444444  0.200000  0.200000  

对于"e"行,土豆"位于第一行,苹果"位于第二行.但是,苹果的得分高于马铃薯.这对我的应用程序来说是错误的方法.

For row "e", "potato" is in row 1 and "apple" in row 2. However, apple got a higher score than potato. This is the wrong way round for my application.

如何获得较高的得分结果,请始终保持在左侧?

How do I get the higher scoring results the be consistently to the left please?

编辑1 :我尝试了一个简单的代码:

Edit 1: I tried a simpler code:

 import difflib
 Li_A = ["potato", "tomato", "squash", "apple", "pear"]
 Q    = "pea7"
 A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)

&得到了相同的结果:

& got the same result:

 A: ['pear', 'potato', 'apple', 'tomato', 'squash']

我还尝试了一个更简单的评分代码:

I also tried a simpler scoring code:

 import difflib
 S1 = difflib.SequenceMatcher(None, "pea7", "potato")
 R1 = S1.ratio()
 S2 = difflib.SequenceMatcher(None, "pea7", "apple")
 R2 = S2.ratio()

&我再次得到了相同的结果:

& again I got the same result:

 R1: 0.4
 R2: 0.444

编辑2 我尝试了Fuzzywuzzy.因为Fuzzywuzzy依赖difflib,所以我再次得到了相同的结果:

Edit 2 I tried it with fuzzywuzzy. I got the same result again since fuzzywuzzy depends on difflib:

 from fuzzywuzzy import fuzz
 R1 = fuzz.ratio("pea7", "potato")
 R2 = fuzz.ratio("pea7", "apple")

推荐答案

SequenceMatcher使用Ratcliff和Metzener于1988年描述的方法正确计算比率.两个字符串(CT)中的字符总数:

SequenceMatcher is correctly calculating the ratio using the method described by Ratcliff and Metzener, 1988. That is, for the number of characters found in common (CC) and the total number of characters in the two strings (CT):

ratio = 2.CC/CT 

所以看起来问题出在get_close_matches

So it looks like the issue is with get_close_matches

这篇关于数据帧中的多个拼写结果1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆