对具有相同含义的单词进行分类 [英] Classify words with the same meaning

查看:68
本文介绍了对具有相同含义的单词进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从电子邮件中收到50.000个主题行,我想根据同义词或可以代替其他单词使用的单词对其中的单词进行分类.

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.

例如:

销量最高!

最佳销量

我希望他们在同一个小组中.

I want them to be in the same group.

我使用nltk的wordnet构建了以下功能,但效果不佳.

I build the following function with nltk's wordnet but it doesn't work well.

def synonyms(w,group,guide):
    try:
         # Check if the words is similar
        w1 = wordnet.synset(w +'.'+guide+'.01')
        w2 = wordnet.synset(group +'.'+guide+'.01')

        if w1.wup_similarity(w2)>=0.7:
             return True

        elif w1.wup_similarity(w2)<0.7:
            return False

    except:
         return False

有什么想法或工具可以实现这一目标吗?

Any ideas or tools to accomplish this?

推荐答案

最简单的方法是比较各个单词嵌入的相似性(最常见的实现是Word2Vec).

The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).

Word2Vec是一种在向量空间中表示标记的语义含义的方法,它使得可以比较单词的含义,而无需像WordNet这样的大型词典/同义词库.

Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.

Word2Vec常规实现的一个问题是,它确实区分了同一单词的不同含义.例如,单词 bank 在所有这些句子中都具有相同的Word2Vec表示形式:

One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:

  • 河岸河干dry了.
  • 银行借了钱给我.
  • 飞机可能会向左倾斜.
  • The river bank was dry.
  • The bank loaned money to me.
  • The plane may bank to the left.

在每种情况下,银行都有相同的向量,但是您可能希望将它们分为不同的组.

Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.

解决此问题的一种方法是使用Sense2Vec实现.Sense2Vec模型考虑了令牌的上下文和词性(以及可能的其他功能),使您能够区分单词的不同含义的含义.

One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.

Spacy 是一个很好的Python库.就像NLTK,但是>更快,因为它是用Cython编写的(令牌化速度提高了20倍,标记语言提高了400倍更快地进行标记).它还内置Sense2Vec嵌入,因此您无需其他库就可以完成相似性任务.

A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.

这很简单:

import spacy

nlp = spacy.load('en') 

apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)

它是免费的,并具有自由许可证!

It's free and has a liberal license!

这篇关于对具有相同含义的单词进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆