如何计算文本文件的相似度? [英] How to calculate the similarity measure of text document?
问题描述
我有一个看起来像CSV文件:
I have CSV file that looks like:
idx messages
112 I have a car and it is blue
114 I have a bike and it is red
115 I don't have any car
117 I don't have any bike
我想拥有读取文件并执行相似度差异的代码.
I would like to have the code that reads the file and performs the similarity difference.
I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.
基于一些帖子和网页,上面写着一个简单而有效的是余弦相似度"或通用句子编码器"或"Levenshtein距离".
based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".
如果您能提供我也可以在我身边运行的代码的帮助,那将是非常不错的.谢谢
It would be great if you can provide your help with code that I can run in my side as well. Thanks
推荐答案
我不知道这样的计算可以很好地向量化,因此循环很简单.至少要利用您的计算是对称且对角线始终为100的事实来减少您执行的计算数量.
I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
K = len(df)
similarity = np.empty((K,K), dtype=float)
for i, ac in enumerate(df['messages']):
for j, bc in enumerate(df['messages']):
if i > j:
continue
if i == j:
sim = 100
else:
sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
# for comparison of 2 strings.
similarity[i, j] = sim
similarity[j, i] = sim
df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)
输出:df_sim
id 112 114 115 117
id
112 100.0 78.0 51.0 50.0
114 78.0 100.0 47.0 54.0
115 51.0 47.0 100.0 83.0
117 50.0 54.0 83.0 100.0
这篇关于如何计算文本文件的相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!