Wordpiece 标记化与传统词形还原? [英] Wordpiece tokenization versus conventional lemmatization?

查看:129
本文介绍了Wordpiece 标记化与传统词形还原?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 NLP 预处理.在某些时候,我想实现一个上下文敏感的词嵌入,作为一种辨别词义的方式,我正在考虑使用 BERT 的输出来做到这一点.我注意到 BERT 使用 WordPiece 标记化(例如,播放"->播放"+##ing").

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").

现在,我使用标准分词器对文本进行预处理,该分词器在空格/一些标点符号上拆分,然后我有一个词形还原器(播放"->播放").我想知道 WordPiece 标记化与标准标记化 + 词形还原相比有什么好处.我知道 WordPiece 可以帮助解决词汇量不足的问题,但还有其他方法吗?也就是说,即使我最终没有使用 BERT,我是否应该考虑用 wordpiece tokenization 替换我的 tokenizer + lemmatizer?这在什么情况下会有用?

Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?

推荐答案

word-piece 标记化在很多方面都有帮助,应该比 lemmatizer 更好.由于多种原因:

The word-piece tokenization helps in multiple ways, and should be better than lemmatizer. Due to multiple reasons:

  1. 如果将playful"、playing"、played"等词词形还原为play",则可能会丢失一些信息,例如playing 是现在时和played 是过去式,这在词片标记化中不会发生.
  2. 词片标记涵盖了所有词,甚至字典中没有出现的词.它拆分单词并且会有词块标记,这样,您应该对拆分的词块进行嵌入,而不是删除单词或替换为未知"标记.
  1. If you have the words 'playful', 'playing', 'played', to be lemmatized to 'play', it can lose some information such as playing is present-tense and played is past-tense, which doesn't happen in word-piece tokenization.
  2. Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with 'unknown' token.

使用词块分词代替分词器+词形还原器只是一种设计选择,词块分词应该表现良好.但是您可能必须考虑计数,因为词块标记化会增加标记的数量,而词形还原并非如此.

Usage of word-piece tokenization instead of tokenizer+lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.

这篇关于Wordpiece 标记化与传统词形还原?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆