BERT和ALBERT中的训练数据损失大,准确性低 [英] Big loss and low accuracy on training data in both BERT and ALBERT

查看:798
本文介绍了BERT和ALBERT中的训练数据损失大,准确性低的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用拥抱面TFBertModel进行分类任务(来自此处:),我使用的是裸露的TFBertModel并添加了头部密集层,而不是TFBertForSequenceClassification,因为我看不到如何使用经过预训练的权重来仅调整模型.

I am using huggingface TFBertModel to do a classification task (from here: ), I am using the bare TFBertModel with an added head dense layer and not TFBertForSequenceClassification since I didn't see how I could use the latter using pretrained weights to only fine-tune the model.

据我所知,微调应该使BERT和ALBERT的准确度都达到80%左右或更高,但我什至没有达到这个数字:

As far as I know, fine tuning should give me about 80% or more accuracy in both BERT and ALBERT, but I am not coming even near that number:

Train on 3600 samples, validate on 400 samples
Epoch 1/2
3600/3600 [==============================] - 177s 49ms/sample - loss: 0.6531 - accuracy: 0.5792 - val_loss: 0.5296 - val_accuracy: 0.7675
Epoch 2/2
3600/3600 [==============================] - 172s 48ms/sample - loss: 0.6288 - accuracy: 0.6119 - val_loss: 0.5020 - val_accuracy: 0.7850

更多时代没有多大区别.

More epochs don't make much difference.

我正在使用CoLA公共数据集进行微调,这就是数据如下:

I am using CoLA public data set to fine-tune , this is how the data looks like:

gj04    1       Our friends won't buy this analysis, let alone the next one we propose.
gj04    1       One more pseudo generalization and I'm giving up.
gj04    1       One more pseudo generalization or I'm giving up.
gj04    1       The more we study verbs, the crazier they get.
...

这是将数据加载到python中的代码:

And this is the code that loads the data into python:

import csv


def get_cola_data(max_items=None):
    csv_file = open('cola_public/raw/in_domain_train.tsv')

    reader = csv.reader(csv_file, delimiter='\t')
    x = []
    y = []

    for row in reader:
        x.append(row[3])
        y.append(float(row[1]))

    if max_items is not None:
        x = x[:max_items]
        y = y[:max_items]

    return x, y

我验证了数据的格式符合我希望其出现在列表中的格式,这是模型本身的代码:

I verified that the data is in the format that I want it to be in the lists, and this is the code of the model itself:

#!/usr/bin/env python

import tensorflow as tf
from tensorflow import keras
from transformers import BertTokenizer, TFBertModel
import numpy as np
from cola_public import get_cola_data


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

bert_model.trainable = False

x_input = keras.Input(shape=(512,), dtype=tf.int64)
x_mask = keras.Input(shape=(512,), dtype=tf.int64)

_, output = bert_model([x_input, x_mask])
output = keras.layers.Dense(1)(output)

model = keras.Model(
    inputs=[x_input, x_mask],
    outputs=output,
    name='bert_classifier',
)

model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(),
    metrics=['accuracy'],
)

train_data_x, train_data_y = get_cola_data(max_items=4000)

encoded_data = [tokenizer.encode_plus(data, add_special_tokens=True, pad_to_max_length=True) for data in train_data_x]

train_data_x = np.array([data['input_ids'] for data in encoded_data])
mask_data_x = np.array([data['attention_mask'] for data in encoded_data])

train_data_y = np.array(train_data_y)

model.fit(
    [train_data_x, mask_data_x],
    train_data_y,
    epochs=2,
    validation_split=0.1,
)

cmd_input = ''

while True:
    print("Type an opinion: ")
    cmd_input = input()
    # print('Your opinion is: %s' % cmd_input)

    if cmd_input == 'exit':
        break

    cmd_input_tokens = tokenizer.encode_plus(cmd_input, add_special_tokens=True, pad_to_max_length=True)
    cmd_input_ids = np.array([cmd_input_tokens['input_ids']])
    cmd_mask = np.array([cmd_input_tokens['attention_mask']])

    model.reset_states()
    result = model.predict([cmd_input_ids, cmd_mask])

    print(result)

现在,无论我是否使用其他数据集,数据集中的其他项目数量,是否在最后一个密集层之前使用了退出层,是否在最后一个密集层之前提供了另一个密集层(单位数更多)或我使用Albert而不是BERT,我总是精度低,损失高,并且验证精度常常高于训练精度.

Now, no matter if I use other dataset, other number of items from the datasets, if I use a dropout layer before the last dense layer, if I give another dense layer before the last one with higher number of units or if I use Albert instead of BERT, I always have low accuracy and high loss, and often, the validation accuracy is higher than training accuracy.

如果我尝试使用BERT/ALBERT进行NER任务,我将得到相同的结果,结果始终是相同的,这使我相信我系统地在微调方面犯了一些根本性的错误.

I have the same results if I try to use BERT/ALBERT for NER task, always the same result, which makes me believe I systematically make some fundamental mistake in fine tuning.

我知道我有bert_model.trainable = False,这正是我想要的,因为我只想训练最后一个头部,而不是预先训练的举重,而且我知道人们可以那样成功地进行训练.即使我用预先训练的重量训练,结果也要差得多.

I know that I have bert_model.trainable = False and it is what I want, since I want to train only the last head and not the pretrained weights and I know that people train that way successfully. Even if I train with the pretrained weights, the results are much worse.

我看到我的适应度非常高,但是我只是无法将手指放在这里可以改进的地方,尤其是看到人们倾向于仅在模型顶部只有一个密集层的情况下就可以获得良好的结果.

I see I have a very high underfit, but I just can't put my finger where I could improve here, especially seeing that people tend tohave good results with just a single dense layer on top of the model.

推荐答案

对于BERT,默认学习率太高.尝试将其设置为来自5e的原始论文的推荐学习率之一. -5、3e-5或2e-5.

The default learning rate is too high for BERT. Try setting it to one of the recommended learning rates from the original paper Appendix A.3 of 5e-5, 3e-5 or 2e-5.

这篇关于BERT和ALBERT中的训练数据损失大,准确性低的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆