如何从训练有素的多标签文本分类模型中预测看不见的数据? [英] How to predict unseen data from trained multi-label text classification model?

查看：44 发布时间：2021/9/5 20:00:47 python tensorflow machine-learning bert-language-model multilabel-classification

本文介绍了如何从训练有素的多标签文本分类模型中预测看不见的数据?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先，我想说我对机器学习完全陌生，并且仍在学习这些东西是如何工作的.我正在将评论分类为多个标签，并通过参考此代码.

First I want to say I am completely new to machine learning and still learning how these things work. I am working on categorizing reviews into multiple labels and built a multi-label text classifier by referring to this code.

该模型经过训练，可将评论分为 9 个标签，并且该模型分别预测每个标签的值.到目前为止，我如何训练模型以及如何测试模型如下.我没有包含文本处理部分，否则代码会很长.

The model is trained to categorize reviews into 9 labels and the model predicts the value for each label individually. Up to now how I trained the model and how I tested the model is as follows. I didn't include the text processing part otherwise it will be a lengthy code.

这是模型，它与使用的模型相同这里

This is the model and it's the same one used here

#bert text model
class TEXT_MODEL(tf.keras.Model):

def __init__(self,
             vocabulary_size,
             embedding_dimensions=128,
             cnn_filters=50,
             dnn_units=512,
             model_output_classes=2,
             dropout_rate=0.1,
             training=False,
             name="text_model"):
    super(TEXT_MODEL, self).__init__(name=name)
    
    self.embedding = layers.Embedding(vocabulary_size,
                                      embedding_dimensions)
    self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                    kernel_size=2,
                                    padding="valid",
                                    activation="relu")
    self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                    kernel_size=3,
                                    padding="valid",
                                    activation="relu")
    self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                    kernel_size=4,
                                    padding="valid",
                                    activation="relu")          

    self.pool = layers.GlobalMaxPool1D()
    
    self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
    self.dropout = layers.Dropout(rate=dropout_rate)
    if model_output_classes == 2:
        self.last_dense = layers.Dense(units=1,
                                       activation="sigmoid")
    else:
        self.last_dense = layers.Dense(units=model_output_classes,
                                       activation="softmax")

def call(self, inputs, training):
    l = self.embedding(inputs)
    l_1 = self.cnn_layer1(l) 
    l_1 = self.pool(l_1) 
    l_2 = self.cnn_layer2(l) 
    l_2 = self.pool(l_2)
    l_3 = self.cnn_layer3(l)
    l_3 = self.pool(l_3) 
    
    concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
    concatenated = self.dense_1(concatenated)
    concatenated = self.dropout(concatenated, training)
    model_output = self.last_dense(concatenated)
    
    return model_output

这是我如何将我的标记化评论传递到模型中并对其进行训练并最终对其进行测试

This how I pass my tokenized reviews into the model and train it and finally tested it

    whole_predictions = []
    whole_real_predictions = []
    whole_threshold_predictions = []
    
    one=0
    #predict for each label individualy

for i in range(len(y_train[0])):
    print("\n" + str(i)+"\'th label prediction started")
    count_zero=0
    count_one=0
    new_label=[]
    new_tokenized_data_train=[]
    label = column(y_train,i)
    count_one=sum(label)
    print("count_one",count_one)
    
    for k in range(len(label)):
        if count_zero< count_one and label[k]==0:
            new_label.append(0)
            new_tokenized_data_train.append(X_train[k])
            count_zero=count_zero+1
        if label[k]==1:
            new_label.append(1)
            new_tokenized_data_train.append(X_train[k])

            
    print("count_zero",count_zero) 
    print()  
    data_with_len = [[value,new_label[j],len(value)]
                     for j,value in enumerate(new_tokenized_data_train)]


    data_with_len.sort(key=lambda x: x[2])
    sorted_data_labels = [(data_lab[0], data_lab[1]) for data_lab in data_with_len]
    processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_data_labels, output_types=(tf.int32, tf.int32))
    BATCH_SIZE = 32
    batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))
    TOTAL_BATCHES = math.ceil(len(sorted_data_labels) / BATCH_SIZE)
    TEST_BATCHES = TOTAL_BATCHES // TOTAL_BATCHES
    batched_dataset.shuffle(TOTAL_BATCHES)
    test_data = batched_dataset.take(TEST_BATCHES)
    train_data = batched_dataset.skip(TEST_BATCHES)
    
   
    VOCAB_LENGTH = len(tokenizer.vocab)
    EMB_DIM = 260
    CNN_FILTERS = 50
    DNN_UNITS = 256
    OUTPUT_CLASSES = 2

    DROPOUT_RATE = 0.2

    NB_EPOCHS = 6

    text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

    if OUTPUT_CLASSES == 2:
        text_model.compile(loss="binary_crossentropy",
                           optimizer="adam",
                           metrics=["acc"])
    else:
        text_model.compile(loss="sparse_categorical_crossentropy",
                           optimizer="adam",
                           metrics=["sparse_categorical_acc"])

    text_model.fit(train_data, epochs=NB_EPOCHS)

    self_label_predictions = []
    self_threshold_predictions = []
    self_label_real_values = []
    print("Predicting " + str(i) + "th label...")
    
    for e,item in enumerate(X_test):
        if e%2==0:
            progress(e,len(X_test))
        res = text_model.predict([item])
        self_label_real_values.append(res[0][0])
      
        if res[0][0] > 0.93:
            self_threshold_predictions.append(res[0][0])
        else :
            self_threshold_predictions.append(0.0)

    whole_threshold_predictions.append(self_threshold_predictions)
    whole_real_predictions.append(self_label_real_values)

    
whole_threshold_predictions = list(map(list, zip(*whole_threshold_predictions)))
whole_real_predictions = list(map(list, zip(*whole_real_predictions)))

现在我需要使用这个 Text_model 来预测看不见的数据.所以我研究了类似的案例，他们提到我需要保存和加载模型，并且在训练模型时必须以类似的方式传递数据.目前，我尝试使用相同的模型，即 Text_model，代码如下.

Now I need to use this Text_model to predict unseen data. So I studied similar cases and they have mentioned that I need to save and load the model and have to pass the data in a similar way when I train the model. For now, I tried to go with the same model which is Text_model and the code is as follows.

   user=pd.read_csv("Noodlesam.csv")
    user=user.dropna()
    user['Trimmed text'] = user['Trimmed text'].astype(str).apply(text_cleaner)
    input=user['Trimmed text'].tolist()

    def tokenized_input(data):
        return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(data))

    tokenized_input = [tokenized_input(data) for data in input]
    print(len(tokenized_input))

    new_whole_threshold_predictions=[]
    new_whole_real_predictions=[]
    new_self_threshold_predictions = []
    new_self_label_real_values = []

      for q,item1 in enumerate(tokenized_input):
              if q%2==0:            
                  progress(q,len(tokenized_input))
              res = text_model.predict([item1])
              new_self_label_real_values.append(res[0][0])
            
              if res[0][0] > 0.93:
                  new_self_threshold_predictions.append(1.0)
              else :
                  new_self_threshold_predictions.append(0.0)

      new_whole_threshold_predictions.append(new_self_threshold_predictions)
      new_whole_real_predictions.append(new_self_label_real_values)

  new_whole_threshold_predictions = list(map(list, zip(*new_whole_threshold_predictions)))
  new_whole_real_predictions = list(map(list, zip(*new_whole_real_predictions)))

但这给了我 9 个标签中只有一个标签的输出.我在前面的代码中很清楚，训练数据已经适合 for k in range(len(label)):loop 内的模型，这意味着 9 次.从这一点来看，我不明白这个模型是如何工作的，我需要知道如何将这个模型用于看不见的数据.

But this gives me the output for only one label out of 9 labels. I quite understand in the previous code, train data has fit on the model inside for k in range(len(label)):loop that means 9 times. From this point, I don't understand how this model works and I need to know how I can use this model for unseen data.

我认为这解释了我的问题，任何帮助将不胜感激

I think this explains my issue and any help would be really appreciated

如何从训练有素的多标签文本分类模型中预测看不见的数据? [英] How to predict unseen data from trained multi-label text classification model?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

如何从训练有素的多标签文本分类模型中预测看不见的数据? [英] How to predict unseen data from trained multi-label text classification model?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭