在不计算整个句子的情况下估计给定句子的标记概率/logits [英] Estimate token probability/logits given a sentence without computing the entire sentence

查看:118
本文介绍了在不计算整个句子的情况下估计给定句子的标记概率/logits的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样一句话:我喜欢坐在我的新椅子上,_____关于生活".

我有一组特定的标记,例如 [watch"、run"、think"、apple"、light"]

And I have a SPECIFIC set of tokens like ["watch", "run", "think", "apple", "light"]

我想计算每个标记作为该不完整句子中的下一个单词出现的概率.希望我应该得到 think" 的概率比 apple" 更高.

I would like to calculate the probability of each of those tokens to appear as the next word in that incomplete sentence. Hopefully I should get that the probability of "think" is higher that "apple" for instance.

我正在使用 pytorch-transformers(特别是 GPT2LMHeadModel),一个可能的解决方案是用每个标记评估整个句子的分数,但是当要评估的标记数量约为 100 或 1000 时计算时间开始过长.

I am working with pytorch-transformers (GPT2LMHeadModel specifically), and a possible solution is to evaluate the score of the full sentence with each of the tokens, but when number of tokens to evaluate is on the order of 100 or 1000 then the computation time starts to be too long.

必须可以只处理句子一次,并以某种方式使用隐藏状态来计算令牌集的概率,但我不知道该怎么做.

It must be possible to process the sentence only once and somehow use the hidden states to calculate the probabilities of the set of tokens, but I don't know how to do it.

有什么想法吗?提前致谢

Any ideas? Thanks in advance

实际代码如下所示(每次都估计完整句子的概率).对于每个句子,运行 score() 方法大约需要 0.1 秒,如果我想评估数千个单词,这会变成几个小时.

The actual code looks like the one below (estimating the probability for the full sentence every time). For every sentence it takes about 0.1 seconds to run the score() method, which turns into hours if I want to evaluate some thousands of words.

from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
import pandas as pd

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")


def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss = model(tensor_input, labels=tensor_input)
    return -loss[0].item()


candidates = ["watch", "run", "think", "apple", "light"]
sent_template = "I like sitting in my new chair and {} about life"
print({candidate: score(sent_template.format(candidate)) for candidate in candidates})

推荐答案

您的示例产生了以下输出,在我的环境中用 282 个候选者完成了大约 48.5 秒(我只进行了 3 次运行):

Your example produced the following output and took around 48.5 seconds with 282 candiates to finish in my environment (I only conducted a 3 runs):

{'watch': -5.406847953796387
, 'run': -5.533411502838135
, 'think': -4.525279521942139
, 'apple': -6.158637046813965
, 'light': -5.835141658782959}

正如评论中提到的,我认为您可以使用 过去进行一些计算 参数和快速 tokenizer 如评论中所示下面的例子:

As mentioned in the comments I think you can spare some calculation with the past parameter and the fast tokenizer as shown in the commented example below:

import torch

from  transformers import GPT2TokenizerFast, GPT2LMHeadModel
from torch.nn import CrossEntropyLoss

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

###We calculate the hidden_states and the past of the common left part of the sentence
past = "I like sitting in my new chair and"
past_tokenize_input = tokenizer.tokenize(past)
past_tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(past_tokenize_input)])

past_last_hidden_state, past = model.transformer(past_tensor_input)

def score(sentence, past, past_last_hidden_state, past_tensor_input):
    tokenize_input = tokenizer.tokenize(sentence, )
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])

    ###the following code is slightly modified from https://github.com/huggingface/transformers/blob/09a2f40684f77e62d0fd8485fe9d2d610390453f/src/transformers/modeling_gpt2.py#L604
    ###now we calculate the right part of the sentence with the already calculated past
    transformer_outputs = model.transformer(
            tensor_input,
            past=past,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            use_cache=None,
            output_attentions=None,
            output_hidden_states=None,
        )
    ###and concatenate the output of with the hidden_state of the left part of the sentence
    hidden_states = torch.cat((past_last_hidden_state, transformer_outputs[0]), dim=1)
    
    ###the following part is exactly the same as https://github.com/huggingface/transformers/blob/09a2f40684f77e62d0fd8485fe9d2d610390453f/src/transformers/modeling_gpt2.py#L604
    lm_logits = model.lm_head(hidden_states)

    labels_input = torch.cat((past_tensor_input, tensor_input), dim=1)

    # Shift so that tokens < n predict n
    shift_logits = lm_logits[..., :-1, :].contiguous()
    shift_labels = labels_input[..., 1:].contiguous()
    # Flatten the tokens
    loss_fct = CrossEntropyLoss()
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    return -loss.item()

candidates = ["watch", "run", "think", "apple", "light"]

sent_template = " {} about life"

print({candidate: score(sent_template.format(candidate), past, past_last_hidden_state, past_tensor_input) for candidate in candidates})

输出:

{'watch': -5.406846046447754
, 'run': -5.533413887023926
, 'think': -4.525280952453613
, 'apple': -6.158637046813965
, 'light': -5.835141181945801}

这里的运行时间为 40.5 秒,有 282 个候选(又是 3 个周期).你也看到我失去了一些精度.

The runtime here was 40.5 seconds with 282 candidates (3 cycles again). You also see that I lost some precision.

非常感谢 patrickvonplaten 他给了我一个很好的 关于过去实施的说明.

Many thanks to patrickvonplaten who gave me a good explanation about the past implementation.

这篇关于在不计算整个句子的情况下估计给定句子的标记概率/logits的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆