使用 word2vec 将单词分类为类别 [英] Using word2vec to classify words in categories

查看:32
本文介绍了使用 word2vec 将单词分类为类别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景

我有一些带有一些样本数据的向量,每个向量都有一个类别名称(地点、颜色、名称).

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

我的目标是训练一个模型,该模型接受一个新的输入字符串并预测它属于哪个类别.例如,如果新输入是紫色",那么我应该能够将颜色"预测为正确的类别.如果新输入是卡尔加里",它应该将地点"预测为正确的类别.

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

方法

我做了一些研究,发现了 Word2vec.该库具有我可以使用的相似性"和最相似性"功能.所以我想到的一种蛮力方法如下:

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

  1. 接受新的意见.
  2. 计算它与每个向量中每个词的相似度并取平均值.

例如,对于输入pink",我可以计算它与向量names"中单词的相似度,取平均值,然后对其他 2 个向量也这样做.给我最高相似度平均值的向量将是输入所属的正确向量.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

问题

鉴于我在 NLP 和机器学习方面的知识有限,我不确定这是否是最好的方法,因此我正在寻求有关解决问题的更好方法的帮助和建议.我愿意接受所有建议,也请指出我在机器学习和 NLP 领域的新手可能犯的任何错误.

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.

推荐答案

如果您正在寻找最简单/最快的解决方案,那么我建议您采用预训练的词嵌入(Word2Vec 或 GloVe)并构建一个简单的查询系统就在它之上.这些向量已经在一个庞大的语料库上进行了训练,并且可能包含对您的领域数据的足够好的近似.

If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

以下是我的解决方案:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

为了运行它,您必须从 此处(小心,800Mb!).运行时,它应该产生如下内容:

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

...看起来很合理.就是这样!如果你不需要这么大的模型,你可以根据他们的tf-idf 分数.请记住,模型大小仅取决于您拥有的数据和您可能希望查询的单词.

... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.

这篇关于使用 word2vec 将单词分类为类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆