将字符串标记转换为整数 [英] Converting string tokens into integers

查看:86
本文介绍了将字符串标记转换为整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将句子的记号转换为整数.但这给了我花车.

I am trying to convert tokens of sentences into integers. But it is giving me floats.

from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

sometext = "hello how are you doing?"

tokens = word_tokenize(sometext)
model = Word2Vec([tokens], min_count=1, size=1)

我这样做的时候

print(model["hello"])

它给了我

[-0.3843384]

我希望它是一个正整数.

I want this to be a positive integer.

推荐答案

没有必要的理由为此使用 Word2Vec .Word2Vec 的重点是将单词映射到具有许多浮点坐标的多维密集"向量.

There's no essential reason to use Word2Vec for this. The point of Word2Vec is to map words to multi-dimensional, "dense" vectors, with many floating-point coordinates.

尽管 Word2Vec 恰巧会扫描您的训练语料库以查找所有唯一的单词,并为每个唯一的单词在其内部数据结构中指定一个整数位置,但是您通常不会只建立一个模型,尺寸( size = 1 ),或向模型询问单词的整数槽(内部实现细节).

Though Word2Vec happens to scan your training corpus for all unique words, and give each unique word an integer position in its internal data-structures, you wouldn't usually make a model of only one-dimension (size=1), or ask the model for the word's integer slot (an internal implementation detail).

如果只需要一个(字符串词)->(一个int id)映射,gensim类 Dictionary 可以做到这一点.参见:

If you just need a (string word)->(int id) mapping, the gensim class Dictionary can do that. See:

https://radimrehurek.com/gensim/corpora/dictionary.html

from nltk.tokenize import word_tokenize
from gensim.corpora.dictionary import Dictionary

sometext = "hello how are you doing?"

tokens = word_tokenize(sometext)
my_vocab = Dictionary([tokens])

print(my_vocab.token2id['hello'])

现在,如果实际上有一些合理的理由要使用 Word2Vec –例如,需要多维矢量来获取更大的词汇量,并且需要大量的可变文本训练–并且您的真正需要是知道个单词的内部整数插槽,您可以通过内部 wv 属性的 vocab 词典访问这些插槽:

Now, if there's actually some valid reason to be using Word2Vec – such as needing the multidimensional vectors for a larger vocabulary, trained on a significant amount of varying text – and your real need is to know its internal integer slots for words, you can access those via the internal wv property's vocab dictionary:

print(model.wv.vocab['hello'].index)

这篇关于将字符串标记转换为整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆