数据帧中的多个拼写结果1 [英] Multiple Spelling Results in a Dataframe 1
问题描述
我有一些包含拼写错误的数据.我正在更正它们,并使用以下代码对拼写的接近程度进行评分:
I have some data containing spelling errors. I'm correcting them and scoring how close the spelling is using the following code:
import pandas as pd
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}
df_Q = pd.DataFrame(Q)
# Define the function that Corrects & Scores the Spelling
def Spelling(ask):
a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)
# List comprehension for all values of a
b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
return pd.Series(a + b)
# Apply the function that Corrects & Scores the Spelling
df_A = df_Q['one'].apply(Spelling)
# Get the column names on the A dataframe
c = len(df_A.columns) // 2
df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
['Score_{}'.format(y) for y in range(c)]
# Join the Q & A dataframes
df_QA = df_Q.join(df_A)
结果如下:
df_QA
one two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4 \
a potat0 po1ato potato tomato pear apple squash
b toma3o 2omato tomato potato pear apple squash
c s5uash squ0sh squash pear apple tomato potato
d ap8le 2pple apple pear tomato squash potato
e pea7 p3ar pear potato apple tomato squash
Score_0 Score_1 Score_2 Score_3 Score_4
a 0.833333 0.500000 0.400000 0.181818 0.166667
b 0.833333 0.333333 0.200000 0.181818 0.166667
c 0.833333 0.200000 0.181818 0.166667 0.166667
d 0.800000 0.222222 0.181818 0.181818 0.181818
e 0.750000 0.400000 0.444444 0.200000 0.200000
对于"e"行,土豆"位于第一行,苹果"位于第二行.但是,苹果的得分高于马铃薯.这对我的应用程序来说是错误的方法.
For row "e", "potato" is in row 1 and "apple" in row 2. However, apple got a higher score than potato. This is the wrong way round for my application.
如何获得较高的得分结果,请始终保持在左侧?
How do I get the higher scoring results the be consistently to the left please?
编辑1 :我尝试了一个简单的代码:
Edit 1: I tried a simpler code:
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = "pea7"
A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)
&得到了相同的结果:
& got the same result:
A: ['pear', 'potato', 'apple', 'tomato', 'squash']
我还尝试了一个更简单的评分代码:
I also tried a simpler scoring code:
import difflib
S1 = difflib.SequenceMatcher(None, "pea7", "potato")
R1 = S1.ratio()
S2 = difflib.SequenceMatcher(None, "pea7", "apple")
R2 = S2.ratio()
&我再次得到了相同的结果:
& again I got the same result:
R1: 0.4
R2: 0.444
编辑2 我尝试了Fuzzywuzzy.因为Fuzzywuzzy依赖difflib,所以我再次得到了相同的结果:
Edit 2 I tried it with fuzzywuzzy. I got the same result again since fuzzywuzzy depends on difflib:
from fuzzywuzzy import fuzz
R1 = fuzz.ratio("pea7", "potato")
R2 = fuzz.ratio("pea7", "apple")
推荐答案
SequenceMatcher使用Ratcliff和Metzener于1988年描述的方法正确计算比率.两个字符串(CT)中的字符总数:
SequenceMatcher is correctly calculating the ratio using the method described by Ratcliff and Metzener, 1988. That is, for the number of characters found in common (CC) and the total number of characters in the two strings (CT):
ratio = 2.CC/CT
所以看起来问题出在get_close_matches
So it looks like the issue is with get_close_matches
这篇关于数据帧中的多个拼写结果1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!