Keras Tokenizer方法到底能做什么? [英] What does Keras Tokenizer method exactly do?
问题描述
有时,我们需要执行以下操作:
On occasion, circumstances require us to do the following:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=my_max)
然后,我们总是会念诵这一口头禅:
Then, invariably, we chant this mantra:
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
尽管我(或多或少)了解了总体效果,但是不管我做了多少研究(显然包括文档),我都无法弄清楚每个人分别做了什么.我想我从来没有见过一个没有另一个.
While I (more or less) understand what the total effect is, I can't figure out what each one does separately, regardless of how much research I do (including, obviously, the documentation). I don't think I've ever seen one without the other.
那么每个都做什么?在任何情况下,您会使用其中一个而不使用另一个吗?如果不是,为什么不将它们简单地组合成类似这样的东西:
So what does each do? Are there any circumstances where you would use either one without the other? If not, why aren't they simply combined into something like:
sequences = tokenizer.fit_on_texts_to_sequences(text)
很抱歉,如果我缺少明显的内容,但是我对此很陌生.
Apologies if I'm missing something obvious, but I'm pretty new at this.
推荐答案
来自源代码:
-
fit_on_texts
根据文本列表更新内部词汇.此方法根据词频创建词汇索引.因此,如果您给它类似猫坐在垫子上"的字样.它将创建一个字典word_index["the"] = 1; word_index["cat"] = 2
它是单词->索引字典,因此每个单词都具有唯一的整数值. 0保留用于填充.因此,较低的整数表示频率较高的单词(通常前几个词是停用词,因为它们出现的次数很多). -
texts_to_sequences
将文本中的每个文本转换为整数序列.因此,它基本上将文本中的每个单词都用word_index
词典中的相应整数值替换.仅此而已,当然也不会涉及任何魔术.
fit_on_texts
Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t.word_index["the"] = 1; word_index["cat"] = 2
it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot).texts_to_sequences
Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from theword_index
dictionary. Nothing more, nothing less, certainly no magic involved.
为什么不将它们组合在一起?,因为您几乎总是适合一次并转换为很多次的序列.您将适合您的训练语料库,并在训练/评估/测试/预测时使用完全相同的word_index
词典将实际文本转换为序列,以将其输入网络.因此,将这些方法分开是很有意义的.
Why don't combine them? Because you almost always fit once and convert to sequences many times. You will fit on your training corpus once and use that exact same word_index
dictionary at train / eval / testing / prediction time to convert actual text into sequences to feed them to the network. So it makes sense to keep those methods separate.
这篇关于Keras Tokenizer方法到底能做什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!