使用 Word2Vec 的文本相似度 [英] Text similarity using Word2Vec

查看:185
本文介绍了使用 Word2Vec 的文本相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 Word2Vec 检查文本的相似性.

我目前正在使用另一种逻辑:

from fuzzywuzzy import fuzzdef sim(名称,数据集):匹配 = dataset.apply(lambda 行: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)返回

(名称是我的专栏).

为了应用此功能,我执行以下操作:

df['Sim']=df.apply(lambda row: sim(row['Text'], df),axis=1)

你能告诉我如何用 Word2Vec 替换 Fuzzy.ratio 以比较数据集中的文本吗?

数据集示例:

文本你好,我是彼得,今天需要我帮你做什么?我需要你早上好,这里的约翰,你打电话是为了你的手机账单吗?这是约翰.我能为你做什么?...

第一个文本和最后一个文本非常相似,尽管它们用不同的词来表达相似的概念.我想创建一个新列,用于为每一行放置相似的文本.我希望你能帮助我.

解决方案

TLDR; 跳到代码实现的最后一节(第 4 部分)

1.模糊 vs 词嵌入

与模糊匹配不同,模糊匹配基本上是edit distancelevenshtein distance 在字母级别匹配字符串,word2vec(以及其他模型,例如 fasttext 和 GloVe)代表每个n 维欧几里得空间中的词.表示每个词的向量称为词向量或词嵌入.

这些词嵌入是大量词的n维向量表示.可以将这些向量相加以创建句子嵌入的表示.具有相似语义的单词的句子将具有相似的向量,因此它们的句子嵌入也会相似.阅读有关 word2vec 内部如何工作的更多信息

假设我有一个包含 2 个词的句子.Word2Vec 将这里的每个单词表示为某个欧几里得空间中的向量.总结起来,就像标准向量加法会导致同一空间中的另一个向量.这可能是使用单个词嵌入表示句子的不错选择.

注意:还有其他组合词嵌入的方法,例如加权总和与 tf-idf 权重,或者直接使用句子嵌入和一种称为 Doc2Vec 的算法.阅读有关此内容的更多信息

找到两个词向量相似度的好方法是cosine-similarity.在此处阅读更多信息.

3.预训练的 word2vec 模型(和其他模型)

word2vec 和此类模型的绝妙之处在于,在大多数情况下,您无需在数据上训练它们.您可以使用经过大量数据训练的预训练词嵌入,并根据词与句子中其他词的共现对词之间的上下文/语义相似性进行编码.

您可以使用 cosine_similarity

检查这些句子嵌入之间的相似性

4.示例代码实现

我使用已经在维基百科上训练过的手套模型(类似于 word2vec),其中每个单词都表示为 50 维向量.除了我从这里使用的模型之外,您还可以选择其他模型 - https://github.com/RaRe-技术/gensim-data

from scipy 导入空间导入 gensim.downloader 作为 apimodel = api.load("glove-wiki-gigaword-50") #从多个模型中选择 https://github.com/RaRe-Technologies/gensim-datas0 = '马克扎克伯格拥有 Facebook 公司's1 = 'Facebook 公司 CEO 是马克·扎克伯格's2 = '微软归比尔盖茨所有's3 = '如何学习日语'def 预处理:返回 [i.lower() for i in s.split()]def get_vector(s):return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))

#句对之间的语义相似度s0 与 s1 ->0.965923011302948s0 与 s2 ->0.8659112453460693s0 与 s3 ->0.5877998471260071

I would like to use Word2Vec to check similarity of texts.

I am currently using another logic:

from fuzzywuzzy import fuzz

def sim(name, dataset):
    matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1)
   return 

(name is my column).

For applying this function I do the following:

df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1)

Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset?

Example of dataset:

Text
Hello, this is Peter, what would you need me to help you with today? 
I need you
Good Morning, John here, are you calling regarding your cell phone bill? 
Hi, this this is John. What can I do for you?
...

The first text and the last one are quite similar, although they have different words to express similar concept. I would like to create a new column where to put, for each row, text that are similar. I hope you can help me.

解决方案

TLDR; skip to the last section (part 4.) for code implementation

1. Fuzzy vs Word embeddings

Unlike a fuzzy match, which is basically edit distance or levenshtein distance to match strings at alphabet level, word2vec (and other models such as fasttext and GloVe) represent each word in a n-dimensional euclidean space. The vector that represents each word is called a word vector or word embedding.

These word embeddings are n-dimensional vector representations of a large vocabulary of words. These vectors can be summed up to create a representation of the sentence's embedding. Sentences with word with similar semantics will have similar vectors, and thus their sentence embeddings will also be similar. Read more about how word2vec works internally here.

Let's say I have a sentence with 2 words. Word2Vec will represent each word here as a vector in some euclidean space. Summing them up, just like standard vector addition will result in another vector in the same space. This can be a good choice for representing a sentence using individual word embeddings.

NOTE: There are other methods of combining word embeddings such as a weighted sum with tf-idf weights OR just directly using sentence embeddings with an algorithm called Doc2Vec. Read more about this here.

2. Similarity between word vectors / sentence vectors

"You shall know a word by the company it keeps"

Words that occur with words (context) are usually similar in semantics/meaning. The great thing about word2vec is that words vectors for words with similar context lie closer to each other in the euclidean space. This lets you do stuff like clustering or just simple distance calculations.

A good way to find how similar 2 words vectors is cosine-similarity. Read more here.

3. Pre-trained word2vec models (and others)

The awesome thing about word2vec and such models is that you don't need to train them on your data for most cases. You can use pre-trained word embedding that has been trained on a ton of data and encodes the contextual/semantic similarities between words based on their co-occurrence with other words in sentences.

You can check similarity between these sentence embeddings using cosine_similarity

4. Sample code implementation

I use a glove model (similar to word2vec) which is already trained on wikipedia, where each word is represented as a 50-dimensional vector. You can choose other models than the one I used from here - https://github.com/RaRe-Technologies/gensim-data

from scipy import spatial
import gensim.downloader as api

model = api.load("glove-wiki-gigaword-50") #choose from multiple models https://github.com/RaRe-Technologies/gensim-data

s0 = 'Mark zuckerberg owns the facebook company'
s1 = 'Facebook company ceo is mark zuckerberg'
s2 = 'Microsoft is owned by Bill gates'
s3 = 'How to learn japanese'

def preprocess(s):
    return [i.lower() for i in s.split()]

def get_vector(s):
    return np.sum(np.array([model[i] for i in preprocess(s)]), axis=0)


print('s0 vs s1 ->',1 - spatial.distance.cosine(get_vector(s0), get_vector(s1)))
print('s0 vs s2 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s2)))
print('s0 vs s3 ->', 1 - spatial.distance.cosine(get_vector(s0), get_vector(s3)))

#Semantic similarity between sentence pairs
s0 vs s1 -> 0.965923011302948
s0 vs s2 -> 0.8659112453460693
s0 vs s3 -> 0.5877998471260071

这篇关于使用 Word2Vec 的文本相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆