如何在BERT中创建TokenEmbeddings? [英] How are the TokenEmbeddings in BERT created?

查看:1516
本文介绍了如何在BERT中创建TokenEmbeddings?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在描述BERT的论文中,有关于WordPiece嵌入的这一段.

In the paper describing BERT, there is this paragraph about WordPiece Embeddings.

我们使用WordPiece嵌入(Wu等, 2016)的词汇量为30,000个.首先 每个序列的标记始终是一种特殊的分类 令牌([CLS]).最终的隐藏状态 与此令牌对应的用作汇总 分类的序列表示 任务.句子对打包在一起 单序列.我们区分句子中的 两种方式.首先,我们将它们分开 令牌([SEP]).其次,我们添加了学习的嵌入 指示每个令牌是否属于 到句子A或句子B.如图1所示, 我们将输入嵌入表示为E,最后隐藏 特殊[CLS]令牌的向量为C 2 RH, 以及第i个输入标记的最终隐藏向量 作为Ti 2 RH. 对于给定的令牌,其输入表示为 通过求和相应的令牌来构造的, 段和位置嵌入.可视化 这种结构的结构如图2所示.

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C 2 RH, and the final hidden vector for the ith input token as Ti 2 RH. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

据我了解,WordPiece将单词拆分为#I #like #swim #ing之类的单词,但不会生成嵌入.但是我在论文和其他资料中都没有发现任何令牌嵌入的生成方式.他们是否在实际的预训练之前就进行了预训练?如何?还是随机初始化?

As I understand, WordPiece splits Words into wordpieces like #I #like #swim #ing, but it does not generate Embeddings. But I did not find anything in the paper and on other sources how those Token Embeddings are generated. Are they pretrained before the actual Pre-training? How? Or are they randomly initialized?

推荐答案

单词是分开训练的,这样最频繁的单词会保持在一起,而最不频繁的单词最终会分解为字符.

The wordpieces are trained separately, such the most frequent words remain together and the less frequent words get split eventually down to characters.

与BERT的其余部分一起对嵌入进行训练.像所有网络中的其他参数一样,反向传播是贯穿所有层进行的,直到嵌入为止.

The embeddings are trained jointly with the rest of BERT. The back-propagation is done through all the layers up to the embeddings which get updated just like any other parameters in the network.

请注意,只有训练批次中实际存在的令牌嵌入才会更新,其余的则保持不变.这也是为什么您需要使用相对较小的单词词汇量的原因,以便在训练过程中所有嵌入都得到足够频繁的更新.

Note that only the embeddings of tokens which are actually present in the training batch get updated and the rest remain unchanged. This also a reason why you need to have relatively small word-piece vocabulary, such that all embeddings get updated frequently enough during the training.

这篇关于如何在BERT中创建TokenEmbeddings?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆