在python中找到最相似的句子 [英] Finding most similar sentences among all in python

查看:66
本文介绍了在python中找到最相似的句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

建议/参考链接/代码表示赞赏.

Suggestions / refer links /codes are appreciated.

我有一个超过 1500 行的数据.每一行都有一个句子.我正在尝试找出在所有句子中找到最相似句子的最佳方法.

I have a data which is having more than 1500 rows. Each row has a sentence. I am trying to find out the best method to find the most similar sentences among all.

我的尝试

  1. 我尝试了 K-mean 算法,该算法将相似的句子分组在一个集群中.但是我发现了一个缺点,我必须通过 K 来创建一个集群.很难猜到K.我尝试了 elbo 方法来猜测集群,但将所有组合在一起是不够的.在这种方法中,我将所有数据分组.我正在寻找与 0.90% 以上的数据类似的数据,应返回 ID.

  1. I have tried K-mean algorithm which groups similar sentences in a cluster. But I found a drawback in which I have to pass K to create a cluster. It is hard to guess K. I tried elbo method to guess the clusters but grouping all together isn't sufficient. In this approach I am getting all the data grouped. I am looking for data which is similar above 0.90% data should be returned with ID.

我尝试了余弦相似度,其中我使用 TfidfVectorizer 创建矩阵,然后传入余弦相似度.即使这种方法也不能正常工作.

I tried cosine similarity in which I used TfidfVectorizer to create matrix and then passed in cosine similarity. Even this approach didn't worked properly.

我在找什么

我想要一种方法,我可以在其中传递一个阈值示例 0.90 数据应作为结果返回.

I want an approach where I can pass a threshold example 0.90 data in all rows which are similar to each other above 0.90% should be returned as a result.

Data Sample
ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN   
11    | MAXPREDO Validation is corect
12    | Move to QC  
13    | Cancel ASN WMS Cancel ASN   
14    | MAXPREDO Validation is right
15    | Verify files are sent every hours for this interface from Optima
16    | MAXPREDO Validation are correct
17    | Move to QC  
18    | Verify files are not sent

预期结果

使用 ID

ID    |   DESCRIPTION
-----------------------------
10    | Cancel ASN WMS Cancel ASN
13    | Cancel ASN WMS Cancel ASN
11    | MAXPREDO Validation is corect  # even spelling is not correct
14    | MAXPREDO Validation is right
16    | MAXPREDO Validation are correct
12    | Move to QC  
17    | Move to QC  

推荐答案

为什么余弦相似度和 TFIDF 向量化器对您不起作用?

Why did it not work for you with cosine similarity and the TFIDF-vectorizer?

我试过了,它适用于以下代码:

I tried it and it works with this code:

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(columns=["ID","DESCRIPTION"], data=np.matrix([[10,"Cancel ASN WMS Cancel ASN"],
                                                                [11,"MAXPREDO Validation is corect"],
                                                                [12,"Move to QC"],
                                                                [13,"Cancel ASN WMS Cancel ASN"],
                                                                [14,"MAXPREDO Validation is right"],
                                                                [15,"Verify files are sent every hours for this interface from Optima"],
                                                                [16,"MAXPREDO Validation are correct"],
                                                                [17,"Move to QC"],
                                                                [18,"Verify files are not sent"]
                                                                ]))

corpus = list(df["DESCRIPTION"].values)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

threshold = 0.4

for x in range(0,X.shape[0]):
  for y in range(x,X.shape[0]):
    if(x!=y):
      if(cosine_similarity(X[x],X[y])>threshold):
        print(df["ID"][x],":",corpus[x])
        print(df["ID"][y],":",corpus[y])
        print("Cosine similarity:",cosine_similarity(X[x],X[y]))
        print()

阈值也可以调整,但阈值为 0.9 不会产生您想要的结果.

The threshold can be adjusted as well, but will not yield the results you want with a threshold of 0.9.

阈值为 0.4 的输出为:

The output for a threshold of 0.4 is:

10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]

11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]

12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]

15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]

阈值为 0.39 时,您所有预期的句子都是输出中的特征,但也可以找到带有索引 [15,18] 的附加对:

With a threshold of 0.39 all your expected sentences are features in the output, but an additional pair with the indices [15,18] can be found as well:

10 : Cancel ASN WMS Cancel ASN
13 : Cancel ASN WMS Cancel ASN
Cosine similarity: [[1.]]

11 : MAXPREDO Validation is corect
14 : MAXPREDO Validation is right
Cosine similarity: [[0.64183024]]

11 : MAXPREDO Validation is corect
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]

12 : Move to QC
17 : Move to QC
Cosine similarity: [[1.]]

14 : MAXPREDO Validation is right
16 : MAXPREDO Validation are correct
Cosine similarity: [[0.39895808]]

15 : Verify files are sent every hours for this interface from Optima
18 : Verify files are not sent
Cosine similarity: [[0.44897995]]

这篇关于在python中找到最相似的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆