如何使用gensim快速文本包装器训练单词嵌入表示形式? [英] How to train a word embedding representation with gensim fasttext wrapper?
问题描述
我想用fastext训练自己的单词嵌入.但是,按照本教程操作后,我将无法正确执行此操作.到目前为止,我尝试过:
I would like to train my own word embeddings with fastext. However, after following the tutorial I can not manage to do it properly. So far I tried:
在:
from gensim.models.fasttext import FastText as FT_gensim
# Set file names for train and test data
corpus = df['sentences'].values.tolist()
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(sentences=corpus)
model_gensim
出局:
<gensim.models.fasttext.FastText at 0x7f6087cc70f0>
在:
# train the model
model_gensim.train(
sentences = corpus,
epochs = model_gensim.epochs,
total_examples = model_gensim.corpus_count,
total_words = model_gensim.corpus_total_words
)
print(model_gensim)
出局:
FastText(vocab=107, size=100, alpha=0.025)
但是,当我尝试查看词汇时:
However, when I try to look in a vocabulary words:
print('return' in model_gensim.wv.vocab)
我得到了False
,即使我要传递给快速文本模型的句子中也存在该单词.另外,当我检查最相似的单词以返回时,我得到了字符:
I get False
, even the word is present in the sentences I am passing to the fast text model. Also, when I check the most similar words to return I am getting characters:
model_gensim.most_similar("return")
[('R', 0.15871645510196686),
('2', 0.08545402437448502),
('i', 0.08142799884080887),
('b', 0.07969795912504196),
('a', 0.05666942521929741),
('w', 0.03705815598368645),
('c', 0.032348938286304474),
('y', 0.0319858118891716),
('o', 0.027745068073272705),
('p', 0.026891689747571945)]
使用gensim的快速文本包装器的正确方法是什么?
What is the correct way of using gensim's fasttext wrapper?
推荐答案
gensim FastText
类不将纯字符串作为其训练文本.它期望使用单词列表.如果您传递纯字符串,则它们看起来将像一个单字符列表,并且您将得到像所看到的那样发育不良的词汇表.
The gensim FastText
class doesn't take plain strings as its training texts. It expects lists-of-words, instead. If you pass plain strings, they will look like lists-of-single-characters, and you'll get a stunted vocabulary like you're seeing.
将您的corpus
的每个项目标记为一个单词标记列表,您将获得更接近预期的结果.一种超简单的方法可能是:
Tokenize each item of your corpus
into a list-of-word-tokens and you'll get closer-to-expected results. One super-simple way to do this might just be:
corpus = [s.split() for s in corpus]
但是,通常情况下,您还想做其他事情来适当地标记纯文本-可能是大小写变平,或者使用标点符号等等.
But, usually you'd want to do other things to properly tokenize plain-text as well – perhaps case-flatten, or do something else with punctuation, etc.
这篇关于如何使用gensim快速文本包装器训练单词嵌入表示形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!