如何从训练有素的多标签文本分类模型中预测看不见的数据? [英] How to predict unseen data from trained multi-label text classification model?

查看:44
本文介绍了如何从训练有素的多标签文本分类模型中预测看不见的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我想说我对机器学习完全陌生,并且仍在学习这些东西是如何工作的.我正在将评论分类为多个标签,并通过参考此 代码.

First I want to say I am completely new to machine learning and still learning how these things work. I am working on categorizing reviews into multiple labels and built a multi-label text classifier by referring to this code.

该模型经过训练,可将评论分为 9 个标签,并且该模型分别预测每个标签的值.到目前为止,我如何训练模型以及如何测试模型如下.我没有包含文本处理部分,否则代码会很长.

The model is trained to categorize reviews into 9 labels and the model predicts the value for each label individually. Up to now how I trained the model and how I tested the model is as follows. I didn't include the text processing part otherwise it will be a lengthy code.

这是模型,它与使用的模型相同 这里

This is the model and it's the same one used here

#bert text model
class TEXT_MODEL(tf.keras.Model):

def __init__(self,
             vocabulary_size,
             embedding_dimensions=128,
             cnn_filters=50,
             dnn_units=512,
             model_output_classes=2,
             dropout_rate=0.1,
             training=False,
             name="text_model"):
    super(TEXT_MODEL, self).__init__(name=name)
    
    self.embedding = layers.Embedding(vocabulary_size,
                                      embedding_dimensions)
    self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                    kernel_size=2,
                                    padding="valid",
                                    activation="relu")
    self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                    kernel_size=3,
                                    padding="valid",
                                    activation="relu")
    self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                    kernel_size=4,
                                    padding="valid",
                                    activation="relu")          

    self.pool = layers.GlobalMaxPool1D()
    
    self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
    self.dropout = layers.Dropout(rate=dropout_rate)
    if model_output_classes == 2:
        self.last_dense = layers.Dense(units=1,
                                       activation="sigmoid")
    else:
        self.last_dense = layers.Dense(units=model_output_classes,
                                       activation="softmax")

def call(self, inputs, training):
    l = self.embedding(inputs)
    l_1 = self.cnn_layer1(l) 
    l_1 = self.pool(l_1) 
    l_2 = self.cnn_layer2(l) 
    l_2 = self.pool(l_2)
    l_3 = self.cnn_layer3(l)
    l_3 = self.pool(l_3) 
    
    concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
    concatenated = self.dense_1(concatenated)
    concatenated = self.dropout(concatenated, training)
    model_output = self.last_dense(concatenated)
    
    return model_output

这是我如何将我的标记化评论传递到模型中并对其进行训练并最终对其进行测试

This how I pass my tokenized reviews into the model and train it and finally tested it

    whole_predictions = []
    whole_real_predictions = []
    whole_threshold_predictions = []
    
    one=0
    #predict for each label individualy

for i in range(len(y_train[0])):
    print("\n" + str(i)+"\'th label prediction started")
    count_zero=0
    count_one=0
    new_label=[]
    new_tokenized_data_train=[]
    label = column(y_train,i)
    count_one=sum(label)
    print("count_one",count_one)
    
    for k in range(len(label)):
        if count_zero< count_one and label[k]==0:
            new_label.append(0)
            new_tokenized_data_train.append(X_train[k])
            count_zero=count_zero+1
        if label[k]==1:
            new_label.append(1)
            new_tokenized_data_train.append(X_train[k])

            
    print("count_zero",count_zero) 
    print()  
    data_with_len = [[value,new_label[j],len(value)]
                     for j,value in enumerate(new_tokenized_data_train)]


    data_with_len.sort(key=lambda x: x[2])
    sorted_data_labels = [(data_lab[0], data_lab[1]) for data_lab in data_with_len]
    processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_data_labels, output_types=(tf.int32, tf.int32))
    BATCH_SIZE = 32
    batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))
    TOTAL_BATCHES = math.ceil(len(sorted_data_labels) / BATCH_SIZE)
    TEST_BATCHES = TOTAL_BATCHES // TOTAL_BATCHES
    batched_dataset.shuffle(TOTAL_BATCHES)
    test_data = batched_dataset.take(TEST_BATCHES)
    train_data = batched_dataset.skip(TEST_BATCHES)
    
   
    VOCAB_LENGTH = len(tokenizer.vocab)
    EMB_DIM = 260
    CNN_FILTERS = 50
    DNN_UNITS = 256
    OUTPUT_CLASSES = 2

    DROPOUT_RATE = 0.2

    NB_EPOCHS = 6

    text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

    if OUTPUT_CLASSES == 2:
        text_model.compile(loss="binary_crossentropy",
                           optimizer="adam",
                           metrics=["acc"])
    else:
        text_model.compile(loss="sparse_categorical_crossentropy",
                           optimizer="adam",
                           metrics=["sparse_categorical_acc"])

    text_model.fit(train_data, epochs=NB_EPOCHS)

    self_label_predictions = []
    self_threshold_predictions = []
    self_label_real_values = []
    print("Predicting " + str(i) + "th label...")
    
    for e,item in enumerate(X_test):
        if e%2==0:
            progress(e,len(X_test))
        res = text_model.predict([item])
        self_label_real_values.append(res[0][0])
      
        if res[0][0] > 0.93:
            self_threshold_predictions.append(res[0][0])
        else :
            self_threshold_predictions.append(0.0)

    whole_threshold_predictions.append(self_threshold_predictions)
    whole_real_predictions.append(self_label_real_values)

    
whole_threshold_predictions = list(map(list, zip(*whole_threshold_predictions)))
whole_real_predictions = list(map(list, zip(*whole_real_predictions)))

现在我需要使用这个 Text_model 来预测看不见的数据.所以我研究了类似的案例,他们提到我需要保存和加载模型,并且在训练模型时必须以类似的方式传递数据.目前,我尝试使用相同的模型,即 Text_model,代码如下.

Now I need to use this Text_model to predict unseen data. So I studied similar cases and they have mentioned that I need to save and load the model and have to pass the data in a similar way when I train the model. For now, I tried to go with the same model which is Text_model and the code is as follows.

   user=pd.read_csv("Noodlesam.csv")
    user=user.dropna()
    user['Trimmed text'] = user['Trimmed text'].astype(str).apply(text_cleaner)
    input=user['Trimmed text'].tolist()

    def tokenized_input(data):
        return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(data))

    tokenized_input = [tokenized_input(data) for data in input]
    print(len(tokenized_input))

    new_whole_threshold_predictions=[]
    new_whole_real_predictions=[]
    new_self_threshold_predictions = []
    new_self_label_real_values = []

      for q,item1 in enumerate(tokenized_input):
              if q%2==0:            
                  progress(q,len(tokenized_input))
              res = text_model.predict([item1])
              new_self_label_real_values.append(res[0][0])
            
              if res[0][0] > 0.93:
                  new_self_threshold_predictions.append(1.0)
              else :
                  new_self_threshold_predictions.append(0.0)

      new_whole_threshold_predictions.append(new_self_threshold_predictions)
      new_whole_real_predictions.append(new_self_label_real_values)

  new_whole_threshold_predictions = list(map(list, zip(*new_whole_threshold_predictions)))
  new_whole_real_predictions = list(map(list, zip(*new_whole_real_predictions)))

但这给了我 9 个标签中只有一个标签的输出.我在前面的代码中很清楚,训练数据已经适合 for k in range(len(label)):loop 内的模型,这意味着 9 次.从这一点来看,我不明白这个模型是如何工作的,我需要知道如何将这个模型用于看不见的数据.

But this gives me the output for only one label out of 9 labels. I quite understand in the previous code, train data has fit on the model inside for k in range(len(label)):loop that means 9 times. From this point, I don't understand how this model works and I need to know how I can use this model for unseen data.

我认为这解释了我的问题,任何帮助将不胜感激

I think this explains my issue and any help would be really appreciated

推荐答案

如果我理解正确的话.您可以通过使用 OUTPUT_CLASSES = 2 设置模型的输出来训练模型.通过这个条件让它使用sigmoid作为输出层.

If I understand it correctly. You train the model by setting the output of the model with OUTPUT_CLASSES = 2. Make it use sigmoid for the output layer by this condition.

if model_output_classes == 2:
    self.last_dense = layers.Dense(units=1,
                                   activation="sigmoid")
else:
    self.last_dense = layers.Dense(units=model_output_classes,
                                   activation="softmax")

这意味着您只会得到一个概率范围为 0 到 1 的标签.您可以通过将此变量编辑为多个标签来解决此问题,以将 softmax 用于输出层.Softmax 会给你一个超过标签的概率.

It means you will get only one label with a probability range of 0 to 1. You can fix this by edit this variable to a number of your label to use softmax for the output layer instead. Softmax will give you a probability over labels.

这篇关于如何从训练有素的多标签文本分类模型中预测看不见的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆