如何使用gensim快速文本包装器训练单词嵌入表示形式? [英] How to train a word embedding representation with gensim fasttext wrapper?

查看:154
本文介绍了如何使用gensim快速文本包装器训练单词嵌入表示形式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用fastext训练自己的单词嵌入.但是,按照本教程操作后,我将无法正确执行此操作.到目前为止,我尝试过:

I would like to train my own word embeddings with fastext. However, after following the tutorial I can not manage to do it properly. So far I tried:

在:

from gensim.models.fasttext import FastText as FT_gensim

# Set file names for train and test data
corpus = df['sentences'].values.tolist()

model_gensim = FT_gensim(size=100)

# build the vocabulary
model_gensim.build_vocab(sentences=corpus)
model_gensim

出局:

<gensim.models.fasttext.FastText at 0x7f6087cc70f0>

在:

# train the model
model_gensim.train(
    sentences = corpus, 
    epochs = model_gensim.epochs,
    total_examples = model_gensim.corpus_count, 
    total_words = model_gensim.corpus_total_words
)

print(model_gensim)

出局:

FastText(vocab=107, size=100, alpha=0.025)

但是,当我尝试查看词汇时:

However, when I try to look in a vocabulary words:

print('return' in model_gensim.wv.vocab)

我得到了False,即使我要传递给快速文本模型的句子中也存在该单词.另外,当我检查最相似的单词以返回时,我得到了字符:

I get False, even the word is present in the sentences I am passing to the fast text model. Also, when I check the most similar words to return I am getting characters:

model_gensim.most_similar("return")

[('R', 0.15871645510196686),
 ('2', 0.08545402437448502),
 ('i', 0.08142799884080887),
 ('b', 0.07969795912504196),
 ('a', 0.05666942521929741),
 ('w', 0.03705815598368645),
 ('c', 0.032348938286304474),
 ('y', 0.0319858118891716),
 ('o', 0.027745068073272705),
 ('p', 0.026891689747571945)]

使用gensim的快速文本包装器的正确方法是什么?

What is the correct way of using gensim's fasttext wrapper?

推荐答案

gensim FastText类不将纯字符串作为其训练文本.它期望使用单词列表.如果您传递纯字符串,则它们看起来将像一个单字符列表,并且您将得到像所看到的那样发育不良的词汇表.

The gensim FastText class doesn't take plain strings as its training texts. It expects lists-of-words, instead. If you pass plain strings, they will look like lists-of-single-characters, and you'll get a stunted vocabulary like you're seeing.

将您的corpus的每个项目标记为一个单词标记列表,您将获得更接近预期的结果.一种超简单的方法可能是:

Tokenize each item of your corpus into a list-of-word-tokens and you'll get closer-to-expected results. One super-simple way to do this might just be:

corpus = [s.split() for s in corpus]

但是,通常情况下,您还想做其他事情来适当地标记纯文本-可能是大小写变平,或者使用标点符号等等.

But, usually you'd want to do other things to properly tokenize plain-text as well – perhaps case-flatten, or do something else with punctuation, etc.

这篇关于如何使用gensim快速文本包装器训练单词嵌入表示形式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆