单词ngram的最大长度与上下文窗口大小之间的差异 [英] Difference between max length of word ngrams and size of context window

查看:533
本文介绍了单词ngram的最大长度与上下文窗口大小之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python快速文本库的描述中 https://github.com/用于训练监督模型的facebookresearch/fastText/tree/master/python 有不同的论点,其中有以下论点:

In the description of the fasttext library for python https://github.com/facebookresearch/fastText/tree/master/python for training a supervised model there are different arguments, where among others are stated as:

  • ws:上下文窗口的大小
  • wordNgrams:单词ngram的最大长度.
  • ws: size of the context window
  • wordNgrams: max length of word ngram.

如果我理解正确,那么他们两个都有责任考虑单词的周围单词,但是它们之间的明显区别是什么?

If I understand it right, both of them are responsible for taking into account the surrounding words of the word, but what is the clear difference between them?

推荐答案

首先,我们使用train_unsupervised API创建单词表示模型.我们可以使用两种技术:略图

First, we use the train_unsupervised API to create a Word-Representation Model. There are two techniques that we can use, skipgram and cbow. On the other hand, we use the train_supervised API to create Text Classification Model. You are asking about the train_supervised API, so I will stick to it.

文本分类在快速文本中的工作方式是默认情况下首先使用skipgram表示单词.然后,使用从skipgram模型中学到的这些单词向量对输入文本进行分类.您询问的两个参数(wswordNgrams)与skipgram/cbow模型有关.

The way that text classification work in fasttext, is to first represent the word using skipgram by default. Then, use these word-vectors learned from the skipgram model to classify your input text. The two parameters that you asked about (ws and wordNgrams) are related to the skipgram/cbow model.

下图包含了我们如何使用输入文本来训练skipgram模型的简化图示.在这里,我们将ws参数定义为2,将wordNgrams参数定义为1.

The following image contains a simplified illustration of how we are using our input text to train skipgram model. Here, we defined the ws parameter as 2 and wordNgrams as 1.

我们可以看到,训练数据中只有一个文本为The quick brown fox jumps over the lazy dog.我们将上下文窗口定义为两个,这意味着我们将创建一个中心为center word的窗口,并且该窗口中的下一个/前两个词为target words.然后,我们一次将这个窗口移动一个字.窗口大小越大,您对模型拥有的训练样本越多,则在获得少量数据样本的情况下模型拟合得越多.

As we can see, we have only one text in our training data which is The quick brown fox jumps over the lazy dog. We defined the context window to be two, which means that we will create a window whose center is center word and the next/previous two words within the window are target words. Then, we move this window a word at a time. The bigger the window size is, the more training samples you have for your model, the more overfitted the model becomes given a small sample of data.

这是我们的第一个参数ws.根据第二个参数wordNgrams,如果将wordNgrams设置为2,则将考虑下图所示的两个单词对. (为简单起见,下图中的ws是一个)

That's for our first argument ws. According to the second argument wordNgrams, if we set wordNgrams to 2, it will consider two-word pairs like the following image. (The ws in the following image is one for simplicity)

  • Check this link which contains the source code for the train_supervised method.

skipgram和cbow之间的主要区别可以总结为下图:

There is a major difference between skipgram and cbow that can be summarized in the following image:

这篇关于单词ngram的最大长度与上下文窗口大小之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆