当前,向使用转换器架构的神经机器转换器添加自定义词典的最佳方法是什么? [英] What is currently the best way to add a custom dictionary to a neural machine translator that uses the transformer architecture?

查看:77
本文介绍了当前,向使用转换器架构的神经机器转换器添加自定义词典的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通常将自定义词典添加到机器翻译器中,以确保正确翻译了特定领域的术语.例如,当文档涉及数据中心时和文档涉及餐厅时,服务器一词应以不同的方式翻译.

It's common to add a custom dictionary to a machine translator to ensure that terminology from a specific domain is correctly translated. For example, the term server should be translated differently when the document is about data centers, vs when the document is about restaurants.

对于转换器模型,这样做不是很明显,因为单词未按1:1对齐.我已经看过几篇关于这个主题的论文,但是我不确定哪一个是最好的.解决此问题的最佳做法是什么?

With a transformer model, this is not very obvious to do, since words are not aligned 1:1. I've seen a couple of papers on this topic, but I'm not sure which would be the best one to use. What are the best practices for this problem?

推荐答案

恐怕您不能轻易做到这一点.您不容易将新单词添加到词汇表中,因为您不知道在训练过程中将如何嵌入新词.您可以尝试删除一些单词,或者可以在最终的softmax层中手动更改偏差,以防止某些单词出现在翻译中.其他任何事情都很难做.

I am afraid you cannot easily do that. You cannot easily add new words to the vocabulary because you don't know what embedding it would get during training. You can try to remove some words, or alternatively you can manually change the bias in the final softmax layer to prevent some words from appearing in the translation. Anything else would be pretty difficult to do.

您要执行的操作称为域自适应.要了解通常如何进行域自适应,可以查看调查纸.

What you want to do is called domain adaptation. To get an idea of how domain adaptation is usually done, you can have a look at a survey paper.

最常用的方法可能是模型微调或与语言模型集成.如果您想在自己的域中拥有并行数据,则可以尝试根据该并行数据对模型进行微调(具有简单的SGD,学习率较低).

The most commonly used approaches are probably model finetuning or ensembling with a language model. If you want to have parallel data in your domain, you can try to fine-tune your model on that parallel data (with simple SGD, small learning rate).

如果只有目标语言的单语数据,则可以在该数据上训练语言模型.在解码期间,您可以混合特定领域语言和翻译模型中的概率.不幸的是,我不知道有什么工具可以开箱即用.

If you only have monolingual data in the target language, you train a language model on that data. During the decoding, you can mix the probabilities from the domain-specific language and the translation model. Unfortunately, I don't know of any tool that could do this out of the box.

这篇关于当前,向使用转换器架构的神经机器转换器添加自定义词典的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆