单词ngram的最大长度与上下文窗口大小之间的差异 [英] Difference between max length of word ngrams and size of context window
问题描述
在python快速文本库的描述中 https://github.com/用于训练监督模型的facebookresearch/fastText/tree/master/python 有不同的论点,其中有以下论点:
In the description of the fasttext library for python https://github.com/facebookresearch/fastText/tree/master/python for training a supervised model there are different arguments, where among others are stated as:
-
ws
:上下文窗口的大小 -
wordNgrams
:单词ngram的最大长度.
ws
: size of the context windowwordNgrams
: max length of word ngram.
如果我理解正确,那么他们两个都有责任考虑单词的周围单词,但是它们之间的明显区别是什么?
If I understand it right, both of them are responsible for taking into account the surrounding words of the word, but what is the clear difference between them?
推荐答案
首先,我们使用train_unsupervised
API创建单词表示模型.我们可以使用两种技术:略图和
First, we use the train_unsupervised
API to create a Word-Representation Model. There are two techniques that we can use, skipgram and cbow. On the other hand, we use the train_supervised
API to create Text Classification Model. You are asking about the train_supervised
API, so I will stick to it.
文本分类在快速文本中的工作方式是默认情况下首先使用skipgram表示单词.然后,使用从skipgram模型中学到的这些单词向量对输入文本进行分类.您询问的两个参数(ws
和wordNgrams
)与skipgram/cbow模型有关.
The way that text classification work in fasttext, is to first represent the word using skipgram by default. Then, use these word-vectors learned from the skipgram model to classify your input text. The two parameters that you asked about (ws
and wordNgrams
) are related to the skipgram/cbow model.
下图包含了我们如何使用输入文本来训练skipgram模型的简化图示.在这里,我们将ws
参数定义为2,将wordNgrams
参数定义为1.
The following image contains a simplified illustration of how we are using our input text to train skipgram model. Here, we defined the ws
parameter as 2 and wordNgrams
as 1.
我们可以看到,训练数据中只有一个文本为The quick brown fox jumps over the lazy dog
.我们将上下文窗口定义为两个,这意味着我们将创建一个中心为center word
的窗口,并且该窗口中的下一个/前两个词为target words
.然后,我们一次将这个窗口移动一个字.窗口大小越大,您对模型拥有的训练样本越多,则在获得少量数据样本的情况下模型拟合得越多.
As we can see, we have only one text in our training data which is The quick brown fox jumps over the lazy dog
. We defined the context window to be two, which means that we will create a window whose center is center word
and the next/previous two words within the window are target words
. Then, we move this window a word at a time. The bigger the window size is, the more training samples you have for your model, the more overfitted the model becomes given a small sample of data.
这是我们的第一个参数ws
.根据第二个参数wordNgrams
,如果将wordNgrams
设置为2,则将考虑下图所示的两个单词对. (为简单起见,下图中的ws
是一个)
That's for our first argument ws
. According to the second argument wordNgrams
, if we set wordNgrams
to 2, it will consider two-word pairs like the following image. (The ws
in the following image is one for simplicity)
-
选中此链接,其中包含
train_supervised
方法的源代码.
Check this link which contains the source code for the
train_supervised
method.
skipgram和cbow之间的主要区别可以总结为下图:
There is a major difference between skipgram and cbow that can be summarized in the following image:
这篇关于单词ngram的最大长度与上下文窗口大小之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!