Spacy 的 BERT 模型没有学习 [英] Spacy's BERT model doesn't learn

查看:46
本文介绍了Spacy 的 BERT 模型没有学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试使用 spaCy 的预训练 BERT 模型 de_trf_bertbasecased_lg 来提高我的分类项目的准确性.我曾经使用 de_core_news_sm 从头开始​​构建模型,一切正常:我的准确率约为 70%.但现在我改用 BERT 预训练模型,准确度为 0%.我不相信它的工作如此糟糕,所以我假设我的代码有问题.我可能错过了一些重要的东西,但我不知道是什么.我以这篇文章中的代码为例.

I've been trying to use spaCy's pretrained BERT model de_trf_bertbasecased_lg to increase accuracy in my classification project. I used to build a model from scratch using de_core_news_sm and everything worked fine: I had an accuracy around 70%. But now I am using BERT pretrained model instead and I'm getting 0% accuracy. I don't believe that it's working so bad, so I'm assuming that there is just a problem with my code. I might have missed something important but I can't figure out what. I used the code in this article as an example.

这是我的代码:

import spacy
from spacy.util import minibatch
from random import shuffle

spacy.require_gpu()
nlp = spacy.load('de_trf_bertbasecased_lg')

data = get_data()  # get_data() function returns a list with train data (I'll explain later how it looks)

textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": False})

for category in categories:  # categories - a list of 21 different categories used for classification
    textcat.add_label(category)
nlp.add_pipe(textcat)

num = 0  # number used for counting batches
optimizer = nlp.resume_training()
for i in range(2):
    shuffle(data)
    losses = {}
    for batch in minibatch(data):
        texts, cats = zip(*batch)
        nlp.update(texts, cats, sgd=optimizer, losses=losses)
        num += 1

        if num % 10000 == 0:  # test model's performance every 10000 batches
            acc = test(nlp)  # function test() will be explained later
            print(f'Accuracy: {acc}')

nlp.to_disk('model/')

函数 get_data() 打开不同类别的文件,创建一个像这样的元组 (text, {'cats' : {'category1': 0, 'category2':1,...}}),将所有这些元组收集到一个数组中,然后返回给主函数.

Function get_data() opens files with different categories, creates a tuple like this one (text, {'cats' : {'category1': 0, 'category2':1, ...}}), gathers all these tuples into one array, which is then being returned to the main function.

函数 test(nlp) 打开带有测试数据的文件,预测文件中每一行的类别并检查预测是否正确.

Function test(nlp) opens the file with test data, predicts categories for each line in the file and checks, whether the prediction was correct.

同样,de_core_news_sm 一切正常,所以我很确定函数 get_data()test(nlp) 是工作正常.上面的代码看起来像示例,但仍然是 0% 准确率.我不明白我做错了什么.

Again, everything worked just fine with de_core_news_sm, so I'm pretty sure that functions get_data() and test(nlp) are working fine. Code above looks like in example but still 0% accuracy.I don't understand what I'm doing wrong.

在此先感谢您的帮助!

更新

为了理解上面的问题,我决定只用几个例子来尝试这个模型(就像建议的那样此处).代码如下:

Trying to understand the above problem I decided to try the model with only a few examples (just like it is advised here). Here is the code:

import spacy
from spacy.util import minibatch
import random
import torch

train_data = [
    ("It is realy cool", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("I hate it", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}})
]

is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
    torch.set_default_tensor_type("torch.cuda.FloatTensor")

nlp = spacy.load("en_trf_bertbaseuncased_lg")
textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
for label in ("POSITIVE", "NEGATIVE"):
    textcat.add_label(label)
nlp.add_pipe(textcat)

optimizer = nlp.resume_training()
for i in range(10):
    random.shuffle(train_data)
    losses = {}
    for batch in minibatch(train_data):
        texts, cats = zip(*batch)
        nlp.update(texts, cats, sgd=optimizer, losses=losses)
    print(i, losses)
print()

test_data = [
    "It is really cool",
    "I hate it",
    "Great!",
    "I do not think this is cool"
]

for line in test_data:
    print(line)
    print(nlp(line).cats)

输出是:

0 {'trf_textcat': 0.125}
1 {'trf_textcat': 0.12423406541347504}
2 {'trf_textcat': 0.12188033014535904}
3 {'trf_textcat': 0.12363225221633911}
4 {'trf_textcat': 0.11996611207723618}
5 {'trf_textcat': 0.14696261286735535}
6 {'trf_textcat': 0.12320466339588165}
7 {'trf_textcat': 0.12096124142408371}
8 {'trf_textcat': 0.15916231274604797}
9 {'trf_textcat': 0.1238454058766365}

It is really cool
{'POSITIVE': 0.47827497124671936, 'NEGATIVE': 0.5217249989509583}
I hate it
{'POSITIVE': 0.47827598452568054, 'NEGATIVE': 0.5217240452766418}
Great!
{'POSITIVE': 0.4782750606536865, 'NEGATIVE': 0.5217249393463135}
I do not think this is cool
{'POSITIVE': 0.478275328874588, 'NEGATIVE': 0.5217246413230896}

不仅模型表现不佳,损失并没有变小,所有测试句子的分数几乎相同.最重要的是:它甚至没有正确解决这些问题,而这些问题恰好在火车数据中.所以我的问题是:模型甚至可以学习吗?我做错了什么?

Not only the model performs bad, the loss is not getting smaller and scores for all the test sentences are almost the same. And most importantly: it didn't even get those questions correct, that happened to be in the train data. So my question is: does the model even learn? And what am I doing wrong?

有什么想法吗?

推荐答案

收到了我关于 GitHub 看起来必须指定一些优化器参数,就像在 这个例子.

Received an answer to my question on GitHub and it looks like there must be some optimizer parameters specified, just like in this example.

这篇关于Spacy 的 BERT 模型没有学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆