张量 a (707) 的大小必须与非单维 1 处的张量 b (512) 的大小匹配 [英] The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

查看:50
本文介绍了张量 a (707) 的大小必须与非单维 1 处的张量 b (512) 的大小匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用预训练的 BERT 模型进行文本分类.我在我的数据集上训练了模型,并处于测试阶段;我知道 BERT 只能接受 512 个令牌,所以我写了 if 条件来检查我的数据帧中测试语句的长度.如果它长于 512,我将句子分成序列,每个序列有 512 个标记.然后进行分词器编码.序列的长度为 512,但是,在进行标记化编码后,长度变为 707,并且出现此错误.

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition to check the length of the test senetence in my dataframe. If it is longer than 512 I split the sentence into sequences each sequence has 512 token. And then do tokenizer encode. The length of the seqience is 512, however, after doing tokenize encode the length becomes 707 and I get this error.

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

这是我用来执行前面步骤的代码:

Here is the code I used to do the preivous steps:

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
import math

pred=[]
if (len(test_sentence_in_df.split())>512):
  
  n=math.ceil(len(test_sentence_in_df.split())/512)
  for i in range(n):
    if (i==(n-1)):
      print(i)
      test_sentence=' '.join(test_sentence_in_df.split()[i*512::])
    else:
      print("i in else",str(i))
      test_sentence=' '.join(test_sentence_in_df.split()[i*512:(i+1)*512])
      
      #print(len(test_sentence.split()))  ##here's the length is 512
    tokenized_sentence = tokenizer.encode(test_sentence)
    input_ids = torch.tensor([tokenized_sentence]).cuda()
    print(len(tokenized_sentence)) #### here's the length is 707
    with torch.no_grad():
      output = model(input_ids)
      label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
    pred.append(label_indices)

print(pred)

推荐答案

这是因为,BERT 使用 word-piece 标记化.因此,当某些单词不在词汇表中时,它会将单词拆分为单词片段.例如:如果词表中没有playing,它可以分解为play,##​​ing.这会增加标记化后给定句子中的标记数量.您可以指定某些参数来获得固定长度的标记化:

This is because, BERT uses word-piece tokenization. So, when some of the words are not in the vocabulary, it splits the words to it's word pieces. For example: if the word playing is not in the vocabulary, it can split down to play, ##ing. This increases the amount of tokens in a given sentence after tokenization. You can specify certain parameters to get fixed length tokenization:

tokenized_sentence = tokenizer.encode(test_sentence, padding=True, truncation=True,max_length=50, add_special_tokens = True)

这篇关于张量 a (707) 的大小必须与非单维 1 处的张量 b (512) 的大小匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆