在不完整句子中建议单词列表的 NLP 模型 [英] An NLP Model that Suggest a List of Words in an Incomplete Sentence

查看:88
本文介绍了在不完整句子中建议单词列表的 NLP 模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读过一些关于预测句子中遗漏单词的论文.我真正想要的是创建一个模型,从不完整的句子中建议一个单词.

I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.

  Example:

  Incomplete Sentence :
  I bought an ___________  because its rainy.

  Suggested Words:
      umbrella
      soup
      jacket

从我读过的期刊中,他们利用 Microsoft Sentence Completion Dataset 来预测句子中的缺失词.

From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.

  Example :

  Incomplete Sentence :

  Im sad because you are __________

  Missing Word Options:
  a) crying
  b) happy
  c) pretty
  d) sad
  e) bad

我不想从选项列表中预测丢失的单词.我想从一个不完整的句子中推荐一个单词列表.可行吗?请赐教,因为我真的很困惑.我可以使用什么最先进的模型来从不完整的句子中建议一个单词列表(语义连贯)?

I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?

是否有必要将建议词列表作为输出包含在训练数据集中?

Is it necessary that the list of suggested words as an output is included in the training dataset?

推荐答案

这正是 BERT 训练模型:屏蔽句子中的一些随机单词,并使您的网络预测这些单词.所以是的,这是可行的.而不是,没有必要将建议词列表作为训练输入.然而,这些建议的单词应该是训练这个 BERT 的整体词汇的一部分.

This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.

我改编了这个答案展示完成功能如何工作.

I adapted this answer to show how the completion function may work.

# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert

import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout

def fill_the_gaps(text):
    text = '[CLS] ' + text + ' [SEP]'
    tokenized_text = tokenizer.tokenize(text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segments_ids = [0] * len(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])
    with torch.no_grad():
        predictions = model(tokens_tensor, segments_tensors)
    results = []
    for i, t in enumerate(tokenized_text):
        if t == '[MASK]':
            predicted_index = torch.argmax(predictions[0, i]).item()
            predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
            results.append(predicted_token)
    return results

print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))

[MASK] 符号表示缺少的单词(可以有任意数量的单词).[CLS][SEP] 是特定于 BERT 的特殊令牌.这些特定打印的输出是

The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are

['umbrella']
['here']
['worried']
['here', 'here']

重复并不奇怪——transformer NN 通常擅长复制单词.从语义的角度来看,这些对称的延续看起来确实很有可能.

The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.

此外,如果丢失的不是随机单词,而是恰好最后一个单词(或最后几个单词),则可以使用任何语言模型(例如另一个著名的 SOTA 语言模型,GPT-2) 完成句子.

Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.

这篇关于在不完整句子中建议单词列表的 NLP 模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆