word2vec gensim多种语言 [英] word2vec gensim multiple languages

查看:316
本文介绍了word2vec gensim多种语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题已经完全困扰了我.我正在使用gensim训练Word2Vec模型.我提供了多种语言的数据,即英语和印地语.当我试图找到最接近人"的单词时,这就是我得到的:

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:

model.wv.most_similar(positive = ['man'])
Out[14]: 
[('woman', 0.7380284070968628),
 ('lady', 0.6933152675628662),
 ('monk', 0.6662989258766174),
 ('guy', 0.6513140201568604),
 ('soldier', 0.6491742134094238),
 ('priest', 0.6440571546554565),
 ('farmer', 0.6366012692451477),
 ('sailor', 0.6297377943992615),
 ('knight', 0.6290514469146729),
 ('person', 0.6288090944290161)]
--------------------------------------------

问题是,这些都是英语单词.然后,我试图找出含义相同的印地语和英语单词

Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,

model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will 
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215

此准确性应该比所有其他精度更高.我的印地语语料库是通过翻译英语而制成的.因此,这些词出现在相似的上下文中.因此,它们应该靠近.

This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.

这就是我在这里所做的:

This is what I am doing here:

#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList

#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")

推荐答案

首先,您应该真正使用self.wv.similarity().

First of all, you should really use self.wv.similarity().

我假设您的印地语语料库和英语语料库之间几乎没有单词,因为印地语语料库是梵文,英语是英语.简单地将两个语料库加在一起以创建模型是没有意义的.两种语言中的相应单词同时出现在文档的两个版本中,但是在您的单词嵌入中却没有出现,以便Word2Vec找出最相似的单词.

I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.

例如.直到您的模型知道

Eg. Until your model knows that

男人:Aadmi ::女人:Aurat,

Man:Aadmi::Woman:Aurat,

从嵌入词来看,它永远无法分辨出

from the word embeddings, it can never make out the

Raja:King :: Rani:Queen

Raja:King::Rani:Queen

关系.为此,您需要在两个语料库之间添加 some 锚点. 您可以尝试以下一些建议:

relation. And for that, you need some anchor between the two corpuses. Here are a few suggestions that you can try out:

  1. 制作独立的北印度语语料/模型
  2. 您必须手动创建并维护一些英语->印地语单词对的数据并进行查找.
  3. 在训练过程中,将输入的文档单词随机替换为相应文档中的相应单词

这些可能足以给您一个想法.如果您只想进行翻译,也可以查看 seq2seq .您还可以详细阅读 Word2Vec理论了解它的作用.

These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.

这篇关于word2vec gensim多种语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆