使用word2vec对类别中的单词进行分类 [英] Using word2vec to classify words in categories

查看:343
本文介绍了使用word2vec对类别中的单词进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我有带有一些样本数据的矢量,每个矢量都有一个类别名称(地点,颜色,名称).

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

我的目标是训练一个采用新输入字符串并预测其所属类别的模型.例如,如果新输入为紫色",则我应该能够将颜色"预测为正确的类别.如果新输入的内容为卡尔加里",则应将地点"预测为正确的类别.

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

APPROACH

我做了一些研究,发现了 Word2vec .该库具有我可以使用的相似性"和最相似性"功能.因此,我想到的一种蛮力方法如下:

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

  1. 接受新的输入.
  2. 计算与每个向量中每个单词的相似度并取平均值.

例如,对于输入粉红色",我可以计算其与向量名称"中单词的相似度取平均值,然后对其他2个向量进行相似度计算.给我最高相似度平均值的向量将是输入所属的正确向量.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

问题

鉴于我在NLP和机器学习方面的有限知识,我不确定这是否是最好的方法,因此我正在寻求帮助和建议,以寻求更好的方法来解决我的问题.我乐于接受所有建议,也请指出我刚接触机器学习和NLP领域时可能犯的任何错误.

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.

推荐答案

如果您正在寻找最简单/最快的解决方案,那么我建议您采用经过预训练的单词嵌入(Word2Vec或GloVe),然后构建一个最简单的查询系统.这些向量已在一个庞大的语料库上进行了训练,并且可能包含与您的域数据足够好的近似值.

If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

这是我的以下解决方案:

Here's my solution below:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

要运行它,您必须从

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

...看起来很合理.就是这样!如果不需要这么大的模型,则可以根据glove中的单词 tf-idf 得分.请记住,模型的大小仅取决于您拥有的数据和您可能希望查询的单词.

... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.

这篇关于使用word2vec对类别中的单词进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆