如何计算文本文件的相似度? [英] How to calculate the similarity measure of text document?

查看：226 发布时间：2020/5/18 1:07:36 python pandas csv dataframe nlp

本文介绍了如何计算文本文件的相似度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个看起来像CSV文件:

I have CSV file that looks like:

idx         messages
112  I have a car and it is blue
114  I have a bike and it is red
115  I don't have any car
117  I don't have any bike

我想拥有读取文件并执行相似度差异的代码.

I would like to have the code that reads the file and performs the similarity difference.

我研究了很多与此相关的帖子，例如 1 2 3

I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.

基于一些帖子和网页，上面写着一个简单而有效的是余弦相似度"或通用句子编码器"或"Levenshtein距离".

based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".

如果您能提供我也可以在我身边运行的代码的帮助，那将是非常不错的.谢谢

It would be great if you can provide your help with code that I can run in my side as well. Thanks

推荐答案

我不知道这样的计算可以很好地向量化，因此循环很简单.至少要利用您的计算是对称且对角线始终为100的事实来减少您执行的计算数量.

I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz

K = len(df)
similarity = np.empty((K,K), dtype=float)

for i, ac in enumerate(df['messages']):
    for j, bc in enumerate(df['messages']):
        if i > j:
            continue
        if i == j:
            sim = 100
        else:
            sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
                                     # for comparison of 2 strings.

        similarity[i, j] = sim
        similarity[j, i] = sim

df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)

输出:`df_sim`

id     112    114    115    117
id                             
112  100.0   78.0   51.0   50.0
114   78.0  100.0   47.0   54.0
115   51.0   47.0  100.0   83.0
117   50.0   54.0   83.0  100.0

这篇关于如何计算文本文件的相似度?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何计算文本文件的相似度? [英] How to calculate the similarity measure of text document?

问题描述

推荐答案

输出:`df_sim`

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何计算文本文件的相似度? [英] How to calculate the similarity measure of text document?

问题描述

推荐答案

输出:df_sim

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

输出:`df_sim`

登录关闭