根据含义等同字符串 [英] Equate strings based on meaning

查看:63
本文介绍了根据含义等同字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管不相似,但有没有办法根据它们的含义将python中的字符串等同起来.例如,

  1. 温度.最大
  2. 最高环境温度

我尝试过使用 fuzzywuzzydifflib,虽然它们通常使用令牌匹配对此有好处,但当我对输出进行阈值设置时,它们也会提供误报字符串的数量.是否有其他使用 NLPtokenization 的方法,我在这里遗漏了?

A CO 提供的答案确实解决了上述问题,但是有没有办法使用关键字中的 word2vec 匹配特定子字符串?例如键 = 最高温度已发送 = 预计明天加利福尼亚州的最高环境温度为 34 度.

所以在这里我想获取子字符串最大环境温度".有什么提示吗?

解决方案

正如您所说,像 Fuzzywuzzy 或 difflib 这样的包将受到限制,因为它们根据字符串的拼写而不是它们的含义来计算相似性.

您可以使用词嵌入.词嵌入是词的向量表示,其计算方式允许在一定程度上表示其含义.

生成词嵌入的方法有多种,但最常见的是在一个或一组词级 NLP 任务上训练神经网络,并使用倒数第二层作为词的表示.这样,单词的最终表示应该已经积累了足够的信息来完成任务,而这些信息可以解释为单词含义的近似值.我建议您阅读有关 Word2vec 的一些内容,它是使词嵌入流行的方法,因为它易于理解但代表了词嵌入是什么.这是一篇很好的介绍性文章.两个词之间的相似度可以计算出来,然后通常使用它们的向量表示之间的余弦距离.

当然,您不需要自己训练词嵌入,因为存在大量可用的预训练向量(glove、word2vec、fasttext、spacy...).您将使用哪种嵌入的选择取决于观察到的性能以及您对它们如何适合您要执行的任务的理解.这是一个spacy的词向量的例子,其中句子向量是通过平均词向量来计算的:

# 导入spacy 和fuzzy wuzzy进口空间从 fuzzywuzzy 导入模糊# 加载 spacy 的大英文模型nlp_model = spacy.load('en_core_web_lg')s1 =温度.最大"s2=最高环境温度";s3 = 蓝猫"doc1 = nlp_model (s1)doc2 = nlp_model (s2)doc3 = nlp_model (s3)# 词向量(文档或句子向量是它包含的词向量的平均值)print("'{}' 和 '{}' 之间的文档向量相似度为:{:.4f} ".format(s1, s2, doc1.similarity(doc2)))print("'{}' 和 '{}' 之间的文档向量相似度为:{:.4f}".format(s1, s3, doc1.similarity(doc3)))print("'{}' 和 '{}' 之间的文档向量相似度为:{:.4f}".format(s2, s3, doc2.similarity(doc3)))# 模糊逻辑print("'{}'和'{}'之间的字符比相似度为:{:.4f} ".format(s1, s2, fuzz.ratio(doc1, doc2)))print("'{}'和'{}'之间的字符比相似度为:{:.4f}".format(s1, s3, fuzz.ratio(doc1, doc3)))print("'{}'和'{}'之间的字符比相似度为:{:.4f}".format(s2, s3, fuzz.ratio(doc2, doc3)))

这将打印:

<预><代码>>>>'temp. 之间的文档向量相似性.Max'和'最大环境温度'为:0.6432>>>'temp. 之间的文档向量相似性Max' 和 'the blue cat' 是:0.3810>>>最高环境温度"和蓝猫"之间的文档向量相似度为:0.3117>>>'temp. 之间的字符比相似度.Max' 和 'maximum environment temperature' 为:28.0000>>>'temp. 之间的字符比相似度.Max' 和 'the blue cat' 是:38.0000>>>最高环境温度"和蓝猫"之间的字符比相似度为:21.0000

如您所见,与词向量的相似度更好地反映了文档含义的相似度.

但这实际上只是一个起点,因为可能有很多注意事项.以下是您应该注意的一些事项的列表:

  • 词(和文档)向量并不代表词(或文档)本身的含义,它们是一种近似方法.这意味着它们会在某个时候遇到限制,您不能想当然地认为它们可以让您区分语言的所有细微差别.
  • 我们期望的含义相似"是什么?两个单词/句子之间的差异根据我们的任务而定.例如,什么是理想"?最高温度"之间的相似性和最低温度"?高是因为它们指的是同一概念的极端状态,还是低是因为它们指的是同一概念的相反状态?使用词嵌入,您通常会获得这些句子的高相似度,因为作为最大"和最低限度"通常出现在相同的上下文中,这两个词将具有相似的向量.
  • 在给出的示例中,0.6432 仍然不是很高的相似度.这可能来自示例中缩写词的使用.根据词嵌入的生成方式,它们可能无法很好地处理缩写.一般而言,NLP 算法的输入最好在句法和语法上都正确.根据您的数据集的外观和您对它的了解,事先进行一些清理会非常有帮助.下面是一个语法正确的句子示例,可以更好地突出含义的相似性:
<预><代码>s1 =总统发表了很好的演讲";s2 =我们的代表做了很好的介绍";s3 =总统吃了奶酪通心粉";doc1 = nlp_model (s1)doc2 = nlp_model (s2)doc3 = nlp_model (s3)# 词向量打印(doc1.similarity(doc2))>>>0.8779打印(doc1.similarity(doc3))>>>0.6131打印(doc2.similarity(doc3))>>>0.5771

无论如何,词嵌入可能是您正在寻找的,但您需要花时间了解它们.我建议您阅读有关单词(以及句子和文档)嵌入的内容,并尝试使用不同的预训练向量,以更好地了解如何将它们用于您的任务.

Is there a way to equate strings in python based on their meaning despite not being similar. For example,

  1. temp. Max
  2. maximum ambient temperature

I've tried using fuzzywuzzy and difflib and although they are generally good for this using token matching, they also provide false positives when I threshold the outputs over a large number of strings. Is there some other method using NLP or tokenization that I'm missing here?

Edit: The answer provided by A CO does solve the problem mentioned above but is there any way to match specific substrings using word2vec from a key? e.g. Key = max temp Sent = the maximum ambient temperature expected tomorrow in California is 34 degrees.

So here I'd like to get the substring "maximum ambient temperature". Any tips on that?

解决方案

As you say, packages like fuzzywuzzy or difflib will be limited because they compute similarities based on the spelling of the strings, not on their meaning.

You could use word embeddings. Word embeddings are vector representations of the words, computed in a way that allows to represent their meaning, to a certain extend.

There are different methods for generating word embeddings, but the most common one is to train a neural network on one - or a set - of word-level NLP tasks, and use the penultimate layer as a representation of the word. This way, the final representation of the word is supposed to have accumulated enough information to complete the task, and this information can be interpreted as an approximation for the meaning of the word. I recommend that you read a bit about Word2vec, which is the method that made word embeddings popular, as it is simple to understand but representative for what word embeddings are. Here is a good introductory article. The similarity between two words can be computed then using usually the cosine distance between their vector representations.

Of course, you don't need to train word embeddings yourself, as there exist plenty of pretrained vectors available (glove, word2vec, fasttext, spacy...). The choice of which embedding you will use depend on the observed performance and your understanding of how fit they are for the task you want to perform. Here is an example with spacy's word vectors, where the sentence vector is computed by averaging the word vectors:

# Importing spacy and fuzzy wuzzy
import spacy
from fuzzywuzzy import fuzz

# Loading spacy's large english model
nlp_model = spacy.load('en_core_web_lg')

s1 = "temp. Max"
s2 = "maximum ambient temperature"
s3 = "the blue cat"

doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)


# Word vectors (The document or sentence vector is the average of the word vectors it contains)
print("Document vectors similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, doc1.similarity(doc2)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, doc1.similarity(doc3)))
print("Document vectors similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, doc2.similarity(doc3)))

# Fuzzy logic
print("Character ratio similarity between '{}' and '{}' is: {:.4f} ".format(s1, s2, fuzz.ratio(doc1, doc2)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s1, s3, fuzz.ratio(doc1, doc3)))
print("Character ratio similarity between '{}' and '{}' is: {:.4f}".format(s2, s3, fuzz.ratio(doc2, doc3)))

This will print:

>>> Document vectors similarity between 'temp. Max' and 'maximum ambient temperature' is: 0.6432 
>>> Document vectors similarity between 'temp. Max' and 'the blue cat' is: 0.3810
>>> Document vectors similarity between 'maximum ambient temperature' and 'the blue cat' is: 0.3117

>>> Character ratio similarity between 'temp. Max' and 'maximum ambient temperature' is: 28.0000 
>>> Character ratio similarity between 'temp. Max' and 'the blue cat' is: 38.0000
>>> Character ratio similarity between 'maximum ambient temperature' and 'the blue cat' is: 21.0000

As you can see, the similarity with word vectors reflects better the similarity in the meaning of the documents.

However this is really just a starting point as there can be plenty of caveats. Here is a list of some of the things you should watch out for:

  • Word (and document) vectors do not represent the meaning of the word (or document) per se, they are a way to approximate it. That implies that they will hit a limitation at some point and you cannot take for granted that they will allow you to differentiate all nuances of the language.
  • What we expect to be the "similarity in meaning" between two words/sentences varies according to the task we have. As an example, what would be the "ideal" similarity between "maximum temperature" and "minimum temperature"? High because they refer to an extreme state of the same concept, or low because they refer to opposite states of the same concept? With word embeddings, you will usually get a high similarity for these sentences, because as "maximum" and "minimum" often appear in the same contexts the two words will have similar vectors.
  • In the example given, 0.6432 is still not a very high similarity. This comes probably from the usage of abbreviated words in the example. Depending on how word embeddings have been generated, they might not handle abbreviation well. In a general manner, it is better to have syntactically and grammatically correct inputs to NLP algorithms. Depending on how your dataset looks like and your knowledge of it, doing some cleaning beforehand can be very helpful. Here is an example with grammatically correct sentences that highlights the similarity in meaning better:


s1 = "The president has given a good speech"
s2 = "Our representative has made a nice presentation"
s3 = "The president ate macaronis with cheese"

doc1 = nlp_model (s1)
doc2 = nlp_model (s2)
doc3 = nlp_model (s3)

# Word vectors
print(doc1.similarity(doc2))
>>> 0.8779 
print(doc1.similarity(doc3))
>>> 0.6131
print(doc2.similarity(doc3))
>>> 0.5771

Anyway, word embeddings are probably what you are looking for but you need to take the time to learn about them. I would recommend that you read about word (and sentence, and document) embeddings and that you play a bit around with different pretrained vectors to get a better understanding of how they can be used for the task you have.

这篇关于根据含义等同字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆