python-计算列表单词之间的正字相似度 [英] python - calculate orthographic similarity between words of a list

查看:230
本文介绍了python-计算列表单词之间的正字相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算给定语料库中单词之间的拼写相似度(编辑/Levenshtein距离).

I need to calculate orthographic similarity (edit/Levenshtein distance) among words in a given corpus.

正如基里尔在下面建议的那样,我尝试执行以下操作:

As Kirill suggested below, I tried to do the following:

import csv, itertools, Levenshtein
import numpy as np

# import the list of words from csv file
path = '/Users/my path'
file = path + 'file.csv'

with open(file, 'rb') as f:
    reader = csv.reader(f)
    wordlist = list(reader)

wordlist = np.array(wordlist) #make it a np array
wordlist2 = wordlist[:,0] #subset the first column of the imported list

for a, b in itertools.product(wordlist, wordlist):
    if a < b:
        print(a, b, Levenshtein.distance(a, b))

但是,出现以下错误:

ValueError:具有多个元素的数组的真值不明确.使用a.any()或a.all()

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

我了解代码中的歧义,但是有人可以帮助我找出解决方法吗?谢谢!

I understand the ambiguity in the code, but can someone help me figure out how to solve this? Thanks!

推荐答案

感谢基里尔的帮助,这是我想出的代码.

Here's the code I came up with thank to the help of Kirill.

import csv#, StringIO
import itertools, Levenshtein

# open the newline-separated list of words
path = '/Users/your path'
file = path + 'wordlists.txt'
output = path + 'ortho_similarities.txt'
words = sorted(set(s.strip() for s in open(file)))

# the following loop take all possible pairwise combinations
# of the words in the list words, and calculate the LD
# and then let's write everything in a csv file
with open(output, 'wb') as f:
   writer = csv.writer(f, delimter=",", lineterminator="\n")
   for a, b in itertools.product(words, words):
      if a < b:
        write.writerow([a, b, Levenshtein.distance(a,b)])

这篇关于python-计算列表单词之间的正字相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆