使用 TorchText (PyTorch) 进行语言翻译 [英] Language translation using TorchText (PyTorch)

查看:44
本文介绍了使用 TorchText (PyTorch) 进行语言翻译的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始使用 PyTorch 进行 ML/DL.以下 pytorch 示例说明了我们如何训练一个简单的模型来将德语翻译成英语.

I have recently started with ML/DL using PyTorch. The following pytorch example explains how we can train a simple model for translating from German to English.

https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html

但是,我对如何使用模型对自定义输入进行推理感到困惑.到目前为止,我的理解是:

However I am confused on how to use the model for running inference on custom input. From my understanding so far :

1) 我们需要保存德语(输入)和英语(输出)的词汇"[使用 torch.save()],以便以后可以用于运行预测.

1) We will need to save the "vocab" for both German (input) and English(output) [using torch.save()] so that they can be used later for running predictions.

2) 在对德语段落进行推理时,我们首先需要使用德语词汇文件将德语文本转换为张量.

2) At the time of running inference on a German paragraph, we will first need to convert the German text to tensor using the german vocab file.

3) 上面的张量会传递给模型的forward方法进行翻译

3) The above tensor will be passed to the model's forward method for translation

4) 模型将再次返回目标语言的张量,即当前示例中的英语.

4) The model will again return a tensor for the destination language i.e., English in current example.

5) 我们将使用第一步中保存的英文词汇将这个张量转换回英文文本.

5) We will use the English vocab saved in first step to convert this tensor back to English text.

1) 如果上述理解是正确的,如果我们知道源语言和目标语言并且有相同的词汇文件,上述步骤是否可以被视为在任何语言翻译模型上运行推理的通用方法?或者我们可以使用spacy等第三方库提供的词汇吗?

1) If the above understanding is correct, can the above steps be treated as a generic approach for running inference on any language translation model if we know the source and destination language and have the vocab files for the same? Or can we use the vocab provided by third party libraries like spacy?

2) 我们如何将模型返回的输出张量转换回目标语言?我找不到任何关于如何做到这一点的例子.上面的博客解释了如何使用源语言词汇将输入文本转换为张量.

2) How do we convert the output tensor returned from model back to target language? I couldn't find any example on how to do that. The above blog explains how to convert the input text to tensor using source-language vocab.

我可以很容易地找到图像/视觉模型的各种示例和详细说明,但文本方面的内容不多.

I could easily find various examples and detailed explanation for image/vision models but not much for text.

推荐答案

在全球范围内,您所说的都是正确的,当然您可以使用任何词汇,例如由 spacy 提供.要将张量转换为自然文本,最常用的方法之一是同时保留一个将索引映射到单词的 dict 和另一个将单词映射到索引的 dict,下面的代码可以做到这一点:

Yes globally what you are saying is correct, and of course you can any vocab, e.g. provided by spacy. To convert a tensor into natrual text, one of the most used thechniques is to keep both a dict that maps indexes to words and an other dict that maps words to indexes, the code below can do this:

tok2idx = defaultdict(lambda: 0)
idx2tok = {}

for seq in sequences:
    for tok in seq:
        if not tok in tok2idx:
            tok2idx[tok] = index
            idx2tok[index] = tok
            index += 1

这里的序列是所有序列的列表(即数据集中的句子).如果您只有一个单词或标记列表,则可以轻松更改模型,只需保留内部循环即可.

Here sequences is a list of all the sequences (i.e. sentences in your dataset). You can change the model easily if you have only a list of words or tokens, by only keeping the inner loop.

这篇关于使用 TorchText (PyTorch) 进行语言翻译的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆