数据帧中的多个拼写结果1 [英] Multiple Spelling Results in a Dataframe 1

查看：109 发布时间：2020/6/11 19:29:13 python dataframe difflib spelling fuzzywuzzy

本文介绍了数据帧中的多个拼写结果1的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些包含拼写错误的数据.我正在更正它们，并使用以下代码对拼写的接近程度进行评分:

I have some data containing spelling errors. I'm correcting them and scoring how close the spelling is using the following code:

 import pandas as pd
 import difflib

 Li_A = ["potato", "tomato", "squash", "apple", "pear"]

 Q    = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
         'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}

 df_Q = pd.DataFrame(Q)

 # Define the function that Corrects & Scores the Spelling
 def Spelling(ask):
     a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)

     # List comprehension for all values of a
     b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
     return pd.Series(a + b)

 # Apply the function that Corrects & Scores the Spelling
 df_A = df_Q['one'].apply(Spelling)

 # Get the column names on the A dataframe
 c = len(df_A.columns) // 2
 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
                ['Score_{}'.format(y)    for y in range(c)]

 # Join the Q & A dataframes
 df_QA = df_Q.join(df_A)

结果如下:

 df_QA
       one     two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4  \
 a  potat0  po1ato     potato     tomato       pear      apple     squash   
 b  toma3o  2omato     tomato     potato       pear      apple     squash   
 c  s5uash  squ0sh     squash       pear      apple     tomato     potato   
 d   ap8le   2pple      apple       pear     tomato     squash     potato   
 e    pea7    p3ar       pear     potato      apple     tomato     squash   

     Score_0   Score_1   Score_2   Score_3   Score_4  
 a  0.833333  0.500000  0.400000  0.181818  0.166667  
 b  0.833333  0.333333  0.200000  0.181818  0.166667  
 c  0.833333  0.200000  0.181818  0.166667  0.166667  
 d  0.800000  0.222222  0.181818  0.181818  0.181818  
 e  0.750000  0.400000  0.444444  0.200000  0.200000

对于"e"行，土豆"位于第一行，苹果"位于第二行.但是，苹果的得分高于马铃薯.这对我的应用程序来说是错误的方法.

For row "e", "potato" is in row 1 and "apple" in row 2. However, apple got a higher score than potato. This is the wrong way round for my application.

如何获得较高的得分结果，请始终保持在左侧?

How do I get the higher scoring results the be consistently to the left please?

编辑1 :我尝试了一个简单的代码:

Edit 1: I tried a simpler code:

 import difflib
 Li_A = ["potato", "tomato", "squash", "apple", "pear"]
 Q    = "pea7"
 A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)

&得到了相同的结果:

& got the same result:

 A: ['pear', 'potato', 'apple', 'tomato', 'squash']

我还尝试了一个更简单的评分代码:

I also tried a simpler scoring code:

 import difflib
 S1 = difflib.SequenceMatcher(None, "pea7", "potato")
 R1 = S1.ratio()
 S2 = difflib.SequenceMatcher(None, "pea7", "apple")
 R2 = S2.ratio()

&我再次得到了相同的结果:

& again I got the same result:

 R1: 0.4
 R2: 0.444

编辑2 我尝试了Fuzzywuzzy.因为Fuzzywuzzy依赖difflib，所以我再次得到了相同的结果:

Edit 2 I tried it with fuzzywuzzy. I got the same result again since fuzzywuzzy depends on difflib:

 from fuzzywuzzy import fuzz
 R1 = fuzz.ratio("pea7", "potato")
 R2 = fuzz.ratio("pea7", "apple")

数据帧中的多个拼写结果1 [英] Multiple Spelling Results in a Dataframe 1

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

数据帧中的多个拼写结果1 [英] Multiple Spelling Results in a Dataframe 1

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭