如何处理< UKN>文本生成中的标记 [英] How to handle <UKN> tokens in text generation

查看:146
本文介绍了如何处理< UKN>文本生成中的标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的文本生成数据集中,正如大多数文本生成文献所建议的那样,我已将所有不常用词转换为令牌(未知词).

In my text generation dataset, I have converted all infrequent words into the token (unknown word), as suggested by most text-generation literature.

但是,当训练RNN将句子的一部分作为输入并预测句子的其余部分时,我不确定应该如何阻止网络生成令牌. 当网络在训练集中遇到未知(不频繁)的单词时,其输出应该是什么?

However, when training an RNN to take in part of a sentence as input and predict the rest of the sentence, I am not sure how I should stop the network from generating tokens. When the network encounters an unknown (infrequent) word in the training set, what should its output be?

示例:
句子:I went to the mall and bought a <ukn> and some groceries
网络输入:I went to the mall and bought a
当前网络输出:<unk> and some groceries
所需的网络输出:??? and some groceries

Example:
Sentence: I went to the mall and bought a <ukn> and some groceries
Network input: I went to the mall and bought a
Current network output: <unk> and some groceries
Desired network output: ??? and some groceries

应该输出什么而不是<unk>?

What should it be outputting instead of the <unk>?

我不想建立一个输出不知道的单词的生成器.

I don't want to build a generator that outputs words it does not know.

推荐答案

RNN将为您提供最有可能出现在文本下的令牌示例.在您的代码中,选择概率最高的令牌,在这种情况下为"unk".

A RNN will give you a sampling of tokens that are most likely to appear next in your text. In your code you choose the token with the highest probability, in this case «unk».

在这种情况下,您可以省略«ukn»令牌,而仅取RNN根据其提供的概率值建议的下一个最可能的令牌.

In this case you can omit the «ukn» token and simply take the next most likely token that the RNN suggests based on the probability values that it renders.

这篇关于如何处理&lt; UKN&gt;文本生成中的标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆