在小型自定义语料库上预训练语言模型 [英] Pretraining a language model on a small custom corpus

本文介绍了在小型自定义语料库上预训练语言模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇是否可以在文本生成中使用转移学习,然后对特定类型的文本进行重新训练/预训练。

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.

例如,拥有一个经过预先训练的BERT模型和一小撮医学(或任何类型)文本的语料库,就可以创建一种能够产生医学信息的语言模型文本。假定您没有大量的医学文本,这就是为什么您必须使用迁移学习。

For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.

将其作为管道,我会将其描述为:

Putting it as a pipeline, I would describe this as:


  1. 使用预训练的BERT令牌生成器。

  2. 从我的新文本中获取新令牌,并将其添加到现有的经过预先训练的语言模型(即Vanilla BERT)中。

  3. 使用组合标记器在自定义语料库上重新训练经过预训练的BERT模型。

  4. 生成类似于小型自定义语料库中的文本的文本。

  1. Using a pre-trained BERT tokenizer.
  2. Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
  3. Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
  4. Generating text that resembles the text within the small custom corpus.

听起来是否熟悉?

推荐答案

我还没有听说过您刚才提到的管道。为了为您的用例构建一个LM,基本上有两个选择:

I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:


  1. 进一步训练BERT(-base / -大)模型在您自己的语料库上。此过程也称为 domain-adaption ,如本最新论文。这将使BERT模型的学习参数适应您的特定领域(生物/医学文本)。但是,对于此设置,您将需要相当大的语料库来帮助BERT模型更好地更新其参数。

  1. Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.

使用预先训练的语言模型-从头开始或在原始BERT模型上进行了微调后,对大量特定于域的文本进行了培训。您可能知道,Google发布的香草BERT模型已在Wikipedia文本上进行了训练。香草BERT之后,研究人员试图在Wikipedia之外的其他领域训练BERT体系结构。您可能可以使用对领域特定语言有深刻理解的这些经过预训练的模型。对于您的情况,有一些模型,例如: BioBERT BlueBERT SciBERT

Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides Wikipedia. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.




抱抱脸有可能吗?

Is it possible with hugging-face?

我不确定拥抱的开发人员是否开发了一种健壮的方法来对自定义语料库上的BERT模型进行预训练,因为他们的代码仍在进行中,但如果您对此步骤感兴趣,建议您使用 Google研究的bert 代码用Tensorflow编写,并且完全健壮(由BERT的作者发布)。在他们的自述文件中和使用BERT进行的预培训部分中,已经声明了确切的过程。

I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared.

这篇关于在小型自定义语料库上预训练语言模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆