如何将发音相似的单词放在一起 [英] How to get the similar-sounding words together

查看:71
本文介绍了如何将发音相似的单词放在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图从列表中获得所有类似的发音.

I am trying to get all the similar sounding words from a list.

我尝试使用余弦相似度来获取它们,但这不能实现我的目的.

I tried to get them using cosine similarity but that does not fulfil my purpose.

from sklearn.metrics.pairwise import cosine_similarity
dataList = ['two','fourth','forth','dessert','to','desert']
cosine_similarity(dataList)

我知道这不是正确的方法,我似乎无法得到如下结果:

I know this is not the right approach, I cannot seem to get a result like:

result = ['xx', 'xx', 'yy', 'yy', 'zz', 'zz'] 

它们的意思是听起来相似的词

where they mean that the words which sound similar

推荐答案

首先,您需要使用一种正确的方法来获得相似的发音,即字符串相似性,我建议:

First, you need to use a right way to get the similar sounding words i.e. string similarity, I would suggest:

使用 水母 :

from jellyfish import soundex

print(soundex("two"))
print(soundex("to"))

输出:

T000
T000

现在,也许可以创建一个处理列表的函数,然后对其进行排序以获取它们:

Now perhaps, create a function that would handle the list and then sort it to get them:

def getSoundexList(dList):
    res = [soundex(x) for x in dList]   # iterate over each elem in the dataList
    # print(res)     # ['T000', 'F630', 'F630', 'D263', 'T000', 'D263']
    return res

dataList = ['two','fourth','forth','dessert','to','desert']    
print([x for x in sorted(getSoundexList(dataList))])

输出:

['D263', 'D263', 'F630', 'F630', 'T000', 'T000']

编辑:

另一种方式可能是:

使用 fuzzy :

import fuzzy
soundex = fuzzy.Soundex(4)

print(soundex("to"))
print(soundex("two"))

输出:

T000
T000

编辑2 :

如果要对它们进行分组,则可以使用groupby:

If you want them grouped, you could use groupby:

from itertools import groupby

def getSoundexList(dList):
    return sorted([soundex(x) for x in dList])

dataList = ['two','fourth','forth','dessert','to','desert']    
print([list(g) for _, g in groupby(getSoundexList(dataList), lambda x: x)])

输出:

[['D263', 'D263'], ['F630', 'F630'], ['T000', 'T000']]

编辑3 :

这是@Eric Duminil的名字,假设您要同时使用名称和它们各自的 val :

This ones for @Eric Duminil, let's say you want both the names and their respective val:

使用 dict itemmetter :

from operator import itemgetter

def getSoundexDict(dList):
    return sorted(dict_.items(), key=itemgetter(1))  # sorting the dict_ on val

dataList = ['two','fourth','forth','dessert','to','desert']
res = [soundex(x) for x in dataList]    # to get the val for each elem
dict_ = dict(list(zip(dataList, res)))  # dict_ with k,v as name/val

print([list(g) for _, g in groupby(getSoundexDict(dataList), lambda x: x[1])])

输出:

[[('dessert', 'D263'), ('desert', 'D263')], [('fourth', 'F630'), ('forth', 'F630')], [('two', 'T000'), ('to', 'T000')]]

编辑4 (用于OP):

Soundex:

Soundex是一个系统,通过该系统将值分配给这样的名称听起来相似的名称获得相同值的方式.这些值被称为soundex编码.基于soundex的搜索应用程序不会直接搜索名称,而是会搜索soundex编码.这样,它将获得听起来所有的名称.就像正在寻找的名字一样.

Soundex is a system whereby values are assigned to names in such a manner that similar-sounding names get the same value. These values are known as soundex encodings. A search application based on soundex will not search for a name directly but rather will search for the soundex encoding. By doing so, it will obtain all names that sound like the name being sought.

了解详情.

这篇关于如何将发音相似的单词放在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆