WordPiece 标记化如何有助于有效处理 NLP 中的罕见词问题? [英] How is WordPiece tokenization helpful to effectively deal with rare words problem in NLP?

查看:96
本文介绍了WordPiece 标记化如何有助于有效处理 NLP 中的罕见词问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到 BERT 等 NLP 模型利用 WordPiece 进行标记化.在 WordPiece 中,我们将 playing 等标记拆分为 play##ing.提到它涵盖了更广泛的词汇外 (OOV) 词.有人可以帮我解释一下 WordPiece 标记化是如何实际完成的,以及它如何有效地处理稀有/OOV 单词吗?

解决方案

WordPiece 和 BPE 是两种相似且常用的技术,用于在 NLP 任务中将单词分割为子单词级别.在这两种情况下,词汇表都是用语言中的所有单个字符初始化的,然后词汇表中最频繁/最可能的符号组合被迭代添加到词汇表中.

考虑来自 原始的 WordPiece 算法论文(措辞我稍作修改):

<块引用>

  1. 使用文本中的所有字符初始化单词单元清单.
  2. 使用 1 中的清单在训练数据上构建语言模型.
  3. 通过组合当前词库中的两个单元来生成新的词单元,以将词单元库存增加 1.从所有可能的词单元中选择新词单元,当添加到模型时,训练数据的可能性最大.
  4. 转到 2,直到达到预定义的字单元限制或可能性增加低于某个阈值.

BPE 算法仅在第 3 步有所不同,它只是简单地选择新词unit 作为当前子词单元集合中下一个最常出现的对的组合.

示例

输入文字:她走了.他是个遛狗者.我走路

前 3 个 BPE 合并:

  1. w a = wa
  2. l k = lk
  3. wa lk = walk

因此,在此阶段,您的词汇表包括所有初始字符,以及 walkwalk.您通常会针对固定数量的合并操作执行此操作.

它如何处理稀有/OOV 字词?

很简单,如果你使用这样的切分方法,OOV 词是不可能的.任何不在词汇表中出现的词都会被分解为子词单元.同样,对于稀有词,考虑到我们使用的子词合并次数有限,该词不会出现在词汇表中,因此会被拆分为更频繁的子词.

这有什么帮助?

想象模型看到了walking这个词.除非这个词在训练语料库中至少出现几次,否则模型不能很好地学习处理这个词.但是,它可能有walkedwalkerwalks 等词,每个词只出现几次.没有子词分割,所有这些词都被模型视为完全不同的词.

然而,如果这些被分割为 walk@@ ingwalk@@ ed 等,请注意它们现在都有 walk@@ 共同点,在训练时会经常出现,模型可能会更多地了解它.

I have seen that NLP models such as BERT utilize WordPiece for tokenization. In WordPiece, we split the tokens like playing to play and ##ing. It is mentioned that it covers a wider spectrum of Out-Of-Vocabulary (OOV) words. Can someone please help me explain how WordPiece tokenization is actually done, and how it handles effectively helps to rare/OOV words?

解决方案

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary.

Consider the WordPiece algorithm from the original paper (wording slightly modified by me):

  1. Initialize the word unit inventory with all the characters in the text.
  2. Build a language model on the training data using the inventory from 1.
  3. Generate a new word unit by combining two units out of the current word inventory to increment the word unit inventory by one. Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
  4. Goto 2 until a predefined limit of word units is reached or the likelihood increase falls below a certain threshold.

The BPE algorithm only differs in Step 3, where it simply chooses the new word unit as the combination of the next most frequently occurring pair among the current set of subword units.

Example

Input text: she walked . he is a dog walker . i walk

First 3 BPE Merges:

  1. w a = wa
  2. l k = lk
  3. wa lk = walk

So at this stage, your vocabulary includes all the initial characters, along with wa, lk, and walk. You usually do this for a fixed number of merge operations.

How does it handle rare/OOV words?

Quite simply, OOV words are impossible if you use such a segmentation method. Any word which does not occur in the vocabulary will be broken down into subword units. Similarly, for rare words, given that the number of subword merges we used is limited, the word will not occur in the vocabulary, so it will be split into more frequent subwords.

How does this help?

Imagine that the model sees the word walking. Unless this word occurs at least a few times in the training corpus, the model can't learn to deal with this word very well. However, it may have the words walked, walker, walks, each occurring only a few times. Without subword segmentation, all these words are treated as completely different words by the model.

However, if these get segmented as walk@@ ing, walk@@ ed, etc., notice that all of them will now have walk@@ in common, which will occur much frequently while training, and the model might be able to learn more about it.

这篇关于WordPiece 标记化如何有助于有效处理 NLP 中的罕见词问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆