Fine-Tuning DistilBertForSequenceClassification:不是在学习,为什么loss没有变化?权重没有更新? [英] Fine-Tuning DistilBertForSequenceClassification: Is not learning, why is loss not changing? Weights not updated?

查看:35
本文介绍了Fine-Tuning DistilBertForSequenceClassification:不是在学习,为什么loss没有变化?权重没有更新?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 PyTorch 和 Huggingface-transformers 比较陌生,并在这个 Kaggle 上试验了 DistillBertForSequenceClassification-数据集.

from Transformers import DistilBertForSequenceClassification导入 torch.optim 作为 optim将 torch.nn 导入为 nn从转换器导入 get_linear_schedule_with_warmupn_epochs = 5 # 或其他batch_size = 32 # 或其他bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')标准 = nn.CrossEntropyLoss()优化器 = optim.Adam(bert_distil.parameters(), lr=0.1)X_train = []Y_train = []对于 train_df.iterrows() 中的行:seq = tokenizer.encode(preprocess_text(row[1]['text']), add_special_tokens=True, pad_to_max_length=True)X_train.append(torch.tensor(seq).unsqueeze(0))Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))X_train = torch.cat(X_train)Y_train = torch.cat(Y_train)运行损失 = 0.0bert_distil.cuda()bert_distil.train(真)对于范围内的纪元(n_epochs):排列 = torch.randperm(len(X_train))j = 0对于范围内的 i (0,len(X_train), batch_size):optimizer.zero_grad()指数 = 排列[i:i+batch_size]batch_x, batch_y = X_train[indices], Y_train[indices]batch_x.cuda()batch_y.cuda()输出 = bert_distil.forward(batch_x.cuda())损失=标准(输出[0],batch_y.squeeze().cuda())loss.requires_grad = True损失.向后()优化器.step()running_loss += loss.item()j+=1如果 j == 20:#print(输出[0])print('[%d, %5d] 运行损失:%.3f 损失:%.3f' %(epoch + 1, i*1, running_loss/20, loss.item()))运行损失 = 0.0j = 0

<块引用>

[1, 608] 运行损失:0.689 损失:0.687[1, 1248] 运行损失:0.693 损失:0.694[1, 1888] 运行损失:0.693 损失:0.683[1, 2528] 运行损失:0.689 损失:0.701[1, 3168] 运行损失:0.690 损失:0.684[1, 3808] 运行损耗:0.689 损耗:0.688[1, 4448] 运行损失:0.689 损失:0.692 等等...

无论我怎么尝试,损失都没有减少,甚至增加,预测也没有变好.在我看来,我忘记了一些东西,所以权重实际上没有更新.有人有想法吗?哦

我的尝试

  • 不同的损失函数
    • 公元前
    • 交叉熵
    • 甚至 MSE 损失
  • One-Hot 编码与单个神经元输出
  • 不同的学习率和优化器
  • 我什至将所有目标都改为只有一个标签,但即便如此,网络也没有收敛.

解决方案

查看 running loss 和 minibatch loss 很容易产生误导.您应该查看 epoch loss,因为每次损失的输入都是相同的.

此外,您的代码中存在一些问题,修复了所有问题并且行为符合预期:每个 epoch 后损失缓慢减少,并且它也可能过拟合到一个小的 minibatch.请看代码,改动包括:使用model(x)代替model.forward(x)cuda()只调用一次、较小的学习率等

调整和微调 ML 模型是一项艰巨的工作.

n_epochs = 5批量大小 = 1bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')标准 = nn.CrossEntropyLoss()优化器 = optim.Adam(bert_distil.parameters(), lr=1e-3)X_train = []Y_train = []对于 train_df.iterrows() 中的行:seq = tokenizer.encode(row[1]['text'], add_special_tokens=True, pad_to_max_length=True)[:100]X_train.append(torch.tensor(seq).unsqueeze(0))Y_train.append(torch.tensor([row[1]['target']]))X_train = torch.cat(X_train)Y_train = torch.cat(Y_train)运行损失 = 0.0bert_distil.cuda()bert_distil.train(真)对于范围内的纪元(n_epochs):排列 = torch.randperm(len(X_train))对于范围内的 i (0,len(X_train), batch_size):optimizer.zero_grad()指数 = 排列[i:i+batch_size]batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()输出 = bert_distil(batch_x)损失=标准(输出[0],batch_y)损失.向后()优化器.step()running_loss += loss.item()print('[%d] epoch loss: %.3f' %(epoch + 1, running_loss/len(X_train) * batch_size))运行损失 = 0.0

输出:

[1] epoch loss:0.695[2] 纪元损失:0.690[3] 纪元损失:0.687[4] 纪元损失:0.685[5] 纪元损失:0.684

I am relatively new to PyTorch and Huggingface-transformers and experimented with DistillBertForSequenceClassification on this Kaggle-Dataset.

from transformers import DistilBertForSequenceClassification
import torch.optim as optim
import torch.nn as nn
from transformers import get_linear_schedule_with_warmup

n_epochs = 5 # or whatever
batch_size = 32 # or whatever

bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#bert_distil.classifier = nn.Sequential(nn.Linear(in_features=768, out_features=1), nn.Sigmoid())
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=0.1)

X_train = []
Y_train = []

for row in train_df.iterrows():
    seq = tokenizer.encode(preprocess_text(row[1]['text']),  add_special_tokens=True, pad_to_max_length=True)
    X_train.append(torch.tensor(seq).unsqueeze(0))
    Y_train.append(torch.tensor([row[1]['target']]).unsqueeze(0))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)

running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
    permutation = torch.randperm(len(X_train))
    j = 0
    for i in range(0,len(X_train), batch_size):
        optimizer.zero_grad()
        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X_train[indices], Y_train[indices]
        batch_x.cuda()
        batch_y.cuda()
        outputs = bert_distil.forward(batch_x.cuda())
        loss = criterion(outputs[0],batch_y.squeeze().cuda())
        loss.requires_grad = True
   
        loss.backward()
        optimizer.step()
   
        running_loss += loss.item()  
        j+=1
        if j == 20:   
            #print(outputs[0])
            print('[%d, %5d] running loss: %.3f loss: %.3f ' %
              (epoch + 1, i*1, running_loss / 20, loss.item()))
            running_loss = 0.0
            j = 0

[1, 608] running loss: 0.689 loss: 0.687 [1, 1248] running loss: 0.693 loss: 0.694 [1, 1888] running loss: 0.693 loss: 0.683 [1, 2528] running loss: 0.689 loss: 0.701 [1, 3168] running loss: 0.690 loss: 0.684 [1, 3808] running loss: 0.689 loss: 0.688 [1, 4448] running loss: 0.689 loss: 0.692 etc...

Regardless on what I tried, loss did never decrease, or even increase, nor did the prediction get better. It seems to me that I forgot something so that weights are actually not updated. Someone has an idea? O

what I tried

  • Different loss functions
    • BCE
    • CrossEntropy
    • even MSE-loss
  • One-Hot Encoding vs A single neuron output
  • Different learning rates, and optimizers
  • I even changed all the targets to only one single label, but even then, the network did'nt converge.

解决方案

Looking at running loss and minibatch loss is easily misleading. You should look at epoch loss, because the inputs are the same for every loss.

Besides, there are some problems in your code, fixing all of them and the behavior is as expected: the loss slowly decreases after each epoch, and it can also overfit to a small minibatch. Please look at the code, changes include: using model(x) instead of model.forward(x), cuda() only called once, smaller learning rate, etc.

Tuning and fine-tuning ML models are difficult work.

n_epochs = 5
batch_size = 1

bert_distil = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(bert_distil.parameters(), lr=1e-3)

X_train = []
Y_train = []
for row in train_df.iterrows():
    seq = tokenizer.encode(row[1]['text'],  add_special_tokens=True, pad_to_max_length=True)[:100]
    X_train.append(torch.tensor(seq).unsqueeze(0))
    Y_train.append(torch.tensor([row[1]['target']]))
X_train = torch.cat(X_train)
Y_train = torch.cat(Y_train)

running_loss = 0.0
bert_distil.cuda()
bert_distil.train(True)
for epoch in range(n_epochs):
    permutation = torch.randperm(len(X_train))
    for i in range(0,len(X_train), batch_size):
        optimizer.zero_grad()
        indices = permutation[i:i+batch_size]
        batch_x, batch_y = X_train[indices].cuda(), Y_train[indices].cuda()
        outputs = bert_distil(batch_x)
        loss = criterion(outputs[0], batch_y)
        loss.backward()
        optimizer.step()
   
        running_loss += loss.item()  

    print('[%d] epoch loss: %.3f' %
      (epoch + 1, running_loss / len(X_train) * batch_size))
    running_loss = 0.0

Output:

[1] epoch loss: 0.695
[2] epoch loss: 0.690
[3] epoch loss: 0.687
[4] epoch loss: 0.685
[5] epoch loss: 0.684

这篇关于Fine-Tuning DistilBertForSequenceClassification:不是在学习,为什么loss没有变化?权重没有更新?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆