弦之间的距离(通过声音相似度) [英] Distance between strings by similarity of sound

查看:67
本文介绍了弦之间的距离(通过声音相似度)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

两个单词之间的相似度是否是量化的描述符,基于它们的发音/发音,类似于Levenshtein距离?

Is the a quantitative descriptor of similarity between two words based on how they sound/are pronounced, analogous to Levenshtein distance?

我知道soundex为相似的声音提供了相同的ID 单词,但据我所知,这并不是单词之间差异的定量描述.

I know soundex gives same id to similar sounding words, but as far as I undestood it is not a quantitative descriptor of difference between the words.

from jellyfish import soundex

print(soundex("two"))
print(soundex("to"))

推荐答案

您可以结合语音编码和字符串比较算法.实际上,水母两者都提供.

You could combine phonetic encoding and string comparison algorithm. As a matter of fact jellyfish supplies both.

设置库示例

from jellyfish import soundex, metaphone, nysiis, match_rating_codex,\
    levenshtein_distance, damerau_levenshtein_distance, hamming_distance,\
    jaro_similarity
from itertools import groupby
import pandas as pd
import numpy as np


dataList = ['two','too','to','fourth','forth','dessert',
            'desert','Byrne','Boern','Smith','Smyth','Catherine','Kathryn']

sounds_encoding_methods = [soundex, metaphone, nysiis, match_rating_codex]

让我们比较不同的语音编码

Let compare different phonetic encoding

report = pd.DataFrame([dataList]).T
report.columns = ['word']
for i in sounds_encoding_methods:
    print(i.__name__)
    report[i.__name__]= report['word'].apply(lambda x: i(x))
print(report)
          soundex metaphone   nysiis match_rating_codex
word                                                   
two          T000        TW       TW                 TW
too          T000         T        T                  T
to           T000         T        T                  T
fourth       F630       FR0     FART               FRTH
forth        F630       FR0     FART               FRTH
dessert      D263      TSRT    DASAD               DSRT
desert       D263      TSRT    DASAD               DSRT
Byrne        B650       BRN     BYRN               BYRN
Boern        B650       BRN     BARN                BRN
Smith        S530       SM0     SNAT               SMTH
Smyth        S530       SM0     SNYT              SMYTH
Catherine    C365      K0RN  CATARAN              CTHRN
Kathryn      K365      K0RN   CATRYN             KTHRYN

您可以看到语音编码在使单词可比性方面做得很好.您可能会看到不同的情况,并根据情况选择一个或多个.

You can see that phonetic encoding is doing a pretty good job making comparable the words. You could see different cases and prefer one or other depending on your case.

现在,我将采用以上内容,并尝试使用levenshtein_distance查找最接近的匹配项,但我也可以尝试其他任何匹配项.

Now I will just take the above and try to find the closest match using levenshtein_distance, but I could you any other too.

"""Select the closer by algorithm
for instance levenshtein_distance"""
report2 = pd.DataFrame([dataList]).T
report2.columns = ['word']

report.set_index('word',inplace=True)
report2 = report.copy()
for sounds_encoding in sounds_encoding_methods:
    report2[sounds_encoding.__name__] = np.nan
    matched_words = []
    for word in dataList:
        closest_list = []
        for word_2 in dataList:
            if word != word_2:
                closest = {}
                closest['word'] =  word_2
                closest['similarity'] = levenshtein_distance(report.loc[word,sounds_encoding.__name__],
                                     report.loc[word_2,sounds_encoding.__name__])
                closest_list.append(closest)

        report2.loc[word,sounds_encoding.__name__] = pd.DataFrame(closest_list).\
            sort_values(by = 'similarity').head(1)['word'].values[0]

print(report2)
             soundex  metaphone     nysiis match_rating_codex
word                                                         
two              too        too        too                too
too              two         to         to                 to
to               two        too        too                too
fourth         forth      forth      forth              forth
forth         fourth     fourth     fourth             fourth
dessert       desert     desert     desert             desert
desert       dessert    dessert    dessert            dessert
Byrne          Boern      Boern      Boern              Boern
Boern          Byrne      Byrne      Byrne              Byrne
Smith          Smyth      Smyth      Smyth              Smyth
Smyth          Smith      Smith      Smith              Smith
Catherine    Kathryn    Kathryn    Kathryn            Kathryn
Kathryn    Catherine  Catherine  Catherine          Catherine

从上面可以清楚地看到,语音编码和字符串比较算法之间的组合非常简单.

As from above you could clearly see that combinations between phonetic encoding and string comparison algorithm can be very straight forward.

这篇关于弦之间的距离(通过声音相似度)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆