OpenNLP 训练的“截止"和“迭代"是什么意思? [英] What is the meaning of 'cut-off' and 'iteration' for trainings in OpenNLP?

查看:29
本文介绍了OpenNLP 训练的“截止"和“迭代"是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

cut-offiteration 在 OpenNLP 中训练的意义是什么?或者就此而言,自然语言处理.我只需要对这些术语进行外行解释.就我而言,迭代是算法重复的次数,截止是一个值,如果文本的值高于某个特定类别的截止值,它将被映射到该类别.我说得对吗?

what is the meaning of cut-off and iteration for training in OpenNLP? or for that matter natural language processing. I need just a layman explanation of these terms. As far as I think, iteration is the number of times the algorithm is repeated and cut off is a value such that if a text has value above this cut off for some specific category it will get mapped to that category. Am I right?

推荐答案

正确,术语迭代指的是迭代算法的一般概念,其中人们开始着手通过连续产生(希望越来越准确)一些理想"解决方案的近似值来解决问题.一般来说,迭代次数越多,结果就越准确(更好"),但当然需要进行更多的计算步骤.

Correct, the term iteration refers to the general notion of iterative algorithms, where one sets out to solve a problem by successively producing (hopefully increasingly more accurate) approximations of some "ideal" solution. Generally speaking, the more iterations, the more accurate ("better") the result will be, but of course the more computational steps have to be carried out.

术语截止(又名截止频率)用于指定一种减少n-gram语言模型大小的方法(如由 OpenNLP 使用,例如其词性标注器).考虑以下示例:

The term cutoff (aka cutoff frequency) is used to designate a method of reducing the size of n-gram language models (as used by OpenNLP, e.g. its part-of-speech tagger). Consider the following example:

Sentence 1 = "The cat likes mice."
Sentence 2 = "The cat likes fish."
Bigram model = {"the cat" : 2, "cat likes" : 2, "likes mice" : 1, "likes fish" : 1}

如果你在这个例子中将截止频率设置为 1,n-gram 模型将被简化为

If you set the cutoff frequency to 1 for this example, the n-gram model would be reduced to

Bigram model = {"the cat" : 2, "cat likes" : 2}

也就是说,cutoff 方法从语言模型中移除了那些在训练数据中不常出现的 n-gram.有时需要减少 n-gram 语言模型的大小,因为对于更大的语料库,偶数双元组(更不用说三元组、4 元组等)的数量会激增.然后,剩余信息(n-gram 计数)可用于统计估计一个词(或其 POS 标签)的概率,给定(n-1)以前的词(或 POS 标签).

That is, the cutoff method removes from the language model those n-grams that occur infrequently in the training data. Reducing the size of n-gram language models is sometimes necessary, as the number of even bigrams (let alone trigrams, 4-grams, etc.) explodes for larger corpora. The remaning information (n-gram counts) can then be used to statistically estimate the probability of a word (or its POS tag) given the (n-1) previous words (or POS tags).

这篇关于OpenNLP 训练的“截止"和“迭代"是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆