python-计算列表单词之间的正字相似度 [英] python - calculate orthographic similarity between words of a list
本文介绍了python-计算列表单词之间的正字相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要计算给定语料库中单词之间的拼写相似度(编辑/Levenshtein距离).
I need to calculate orthographic similarity (edit/Levenshtein distance) among words in a given corpus.
正如基里尔在下面建议的那样,我尝试执行以下操作:
As Kirill suggested below, I tried to do the following:
import csv, itertools, Levenshtein
import numpy as np
# import the list of words from csv file
path = '/Users/my path'
file = path + 'file.csv'
with open(file, 'rb') as f:
reader = csv.reader(f)
wordlist = list(reader)
wordlist = np.array(wordlist) #make it a np array
wordlist2 = wordlist[:,0] #subset the first column of the imported list
for a, b in itertools.product(wordlist, wordlist):
if a < b:
print(a, b, Levenshtein.distance(a, b))
但是,出现以下错误:
ValueError:具有多个元素的数组的真值不明确.使用a.any()或a.all()
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
我了解代码中的歧义,但是有人可以帮助我找出解决方法吗?谢谢!
I understand the ambiguity in the code, but can someone help me figure out how to solve this? Thanks!
推荐答案
感谢基里尔的帮助,这是我想出的代码.
Here's the code I came up with thank to the help of Kirill.
import csv#, StringIO
import itertools, Levenshtein
# open the newline-separated list of words
path = '/Users/your path'
file = path + 'wordlists.txt'
output = path + 'ortho_similarities.txt'
words = sorted(set(s.strip() for s in open(file)))
# the following loop take all possible pairwise combinations
# of the words in the list words, and calculate the LD
# and then let's write everything in a csv file
with open(output, 'wb') as f:
writer = csv.writer(f, delimter=",", lineterminator="\n")
for a, b in itertools.product(words, words):
if a < b:
write.writerow([a, b, Levenshtein.distance(a,b)])
这篇关于python-计算列表单词之间的正字相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文