BPE 多种方式对单词进行编码 [英] BPE multiple ways to encode a word

查看:59
本文介绍了BPE 多种方式对单词进行编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于 BPE 或 WordPiece,可能有多种方法可以对单词进行编码.例如,假设(为简单起见)标记词汇包含所有字母以及合并的符号(to"、ke"、en").然后是令牌"这个词.可以被编码为(to"、ke"、n")或(to"、k"、en").本教程中也提到了这种模棱两可的编码 https://blog.floydhub.com/tokenization-nlp/

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", "ke", "en"). Then the word "token" could be encoded as ("to", "ke", "n") or ("to", "k", "en"). Such ambiguous encodings are also mentioned in this tutorial https://blog.floydhub.com/tokenization-nlp/

但是,在hugginface 教程中提到BPE 和WordPiece [...] 以特定顺序制定规则,然后您可以在标记新文本时以相同顺序应用这些规则",请参阅https://huggingface.co/transformers/master/tokenizer_summary.html.

However, in the hugginface tutorial it is mentioned that "BPE and WordPiece [...] work out rules in a certain order that you can then apply in the same order when tokenizing new text", see https://huggingface.co/transformers/master/tokenizer_summary.html.

在使用 BPE/WordPiece 时,这些规则究竟是如何存储和应用的,例如,在我上面的例子中,它是如何确定使用哪种标记化的?

How exactly are these rules stored and applied when using BPE/WordPiece, e.g., in my example above, how is it determined which tokenization to use?

推荐答案

在 BPE 的解析步骤中,合并顺序很重要.例如,如果合并顺序是

In the parsing step of BPE, the merging order matters. For instance, if the merging order is

(p, e), (pe, n), (pen, _), (a, p), (ap, p), (app, l), (appl, e), (apple, _), (pen, apple_)

Applepen PenapplePen 应该被分割成这样:[a, p, p, l, e, pe, pen, a, p, p, l, e, pen],给定 k = 2. 我们只是使用 (p, e), (pe, n) 进行解析.由于合并顺序是固定的,因此对于任何 k 的测试数据,结果应该是固定的.您只需在解析步骤中使用前 k 个合并.

Applepen PenapplePen should be segmented into this: [a, p, p, l, e, pe, pen, a, p, p, l, e, pen], given k = 2. We just use (p, e), (pe, n) for parsing. Since the merging order is fixed, the result should be fixed for the test data for any k. You just use the first k merges in the parsing step.

详情请参考我的回答这个问题:用例子解释bpe(字节对编码)?

For the details please refer to my answer to the question: Explain bpe (Byte Pair Encoding) with examples?

这篇关于BPE 多种方式对单词进行编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆