如何使用Tensorflow中的Hugging Face Transformers库对自定义数据进行文本分类? [英] How to use Hugging Face Transformers library in Tensorflow for text classification on custom data?

查看:1042
本文介绍了如何使用Tensorflow中的Hugging Face Transformers库对自定义数据进行文本分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Hugging Face'Transformers'库提供的不同转换器结构对自定义数据(csv格式)进行二进制文本分类.我正在使用 Tensorflow博客帖子作为参考.

I am trying to do binary text classification on custom data (which is in csv format) using different transformer architectures that Hugging Face 'Transformers' library offers. I am using this Tensorflow blog post as reference.

我正在使用以下代码将自定义数据集加载为"tf.data.Dataset"格式:

I am loading the custom dataset into 'tf.data.Dataset' format using the following code:

def get_dataset(file_path, **kwargs):
   dataset = tf.data.experimental.make_csv_dataset(
     file_path,
     batch_size=5, # Artificially small to make examples easier to show.
     na_value="",
     num_epochs=1,
     ignore_errors=True, 
     **kwargs)
   return dataset 

此后,当我尝试使用'glue_convert_examples_to_features'方法进行标记化,如下所示:

After this when I tried using the 'glue_convert_examples_to_features' method to tokenize as below:

train_dataset = glue_convert_examples_to_features(
                           examples = train_data,
                           tokenizer = tokenizer, 
                           task = None,
                           label_list = ['0', '1'],
                           max_length = 128
                           )

会在以下位置引发错误"UnboundLocalError:赋值之前引用的本地变量'处理器'"

which throws an error "UnboundLocalError: local variable 'processor' referenced before assignment" at:

 if is_tf_dataset:
    example = processor.get_example_from_tensor_dict(example)
    example = processor.tfds_map(example)

在所有示例中,我看到它们正在使用诸如"mrpc"之类的任务,这些任务是预定义的,并且具有可处理的胶水处理器.错误出现在源代码中的"85行"处.

In all the examples, I see that they are using the tasks like 'mrpc' etc which are pre-defined and have a glue_processor to handle. Error raises at the 'line 85' in source code.

任何人都可以使用自定义数据"来帮助解决此问题吗?

Can anyone help with solving this issue using with 'custom data' ?

推荐答案

我遇到了相同的启动问题.

I had the same starting problem.

提交论文对我很有帮助.在那里,您将看到如何根据所选的预训练模型对数据进行标记:

This Kaggle submission helped me a lot. There you can see how you can tokenize the data according to the chosen pre-trained model:

def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
    tokenized_sentences = []

    for sentence in tqdm(sentences):
        tokenized_sentence = tokenizer.encode(
                            sentence,                  # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = max_seq_len,  # Truncate all sentences.
                    )

        tokenized_sentences.append(tokenized_sentence)

    return tokenized_sentences

tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=True)
train_ids = tokenize_sentences(your_sentence_list, tokenizer)

此外,我查看了 glue_convert_examples_to_features的来源.在那里,您可以看到 tf.data.dataset 与BERT兼容的方式可以创建模型.我为此创建了一个函数:

Furthermore, I looked into the source of glue_convert_examples_to_features. There you can see how a tf.data.dataset compatible with the BERT model can be created. I created a function for this:

def create_dataset(ids, masks, labels):
    def gen():
        for i in range(len(train_ids)):
            yield (
                {
                    "input_ids": ids[i],
                    "attention_mask": masks[i]
                },
                labels[i],
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None])
            },
            tf.TensorShape([None]),
        ),
    )

train_dataset = create_dataset(train_ids, train_masks, train_labels)

然后我使用这样的数据集:

I then use the dataset like this:

from transformers import TFBertForSequenceClassification, BertConfig

model = TFBertForSequenceClassification.from_pretrained(
    bert_model_name, 
    config=BertConfig.from_pretrained(bert_model_name, num_labels=20)
)

# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.CategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=1, steps_per_epoch=115, validation_data=val_dataset, validation_steps=7)

这篇关于如何使用Tensorflow中的Hugging Face Transformers库对自定义数据进行文本分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆