当我尝试将 tf-idf 应用于测试集时维度不匹配 [英] Dimension mismatch when I try to apply tf-idf to test set

查看:52
本文介绍了当我尝试将 tf-idf 应用于测试集时维度不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将新的预处理算法应用于我的数据集,遵循以下答案:在机器学习分类器中编码文本

I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier

我现在尝试的是以下内容:

What I have tried now is the following:

def test_tfidf(data, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
    tfidf_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = tfidf_vectorizer.transform(list_corpus)
    
    return X, list_labels

(我建议参考我上面提到的所有代码的链接).当我尝试将后两个函数应用于我的数据集时:

(I would suggest to refer to the link I mentioned above for all the code). When I try to apply the latter two function to my dataset:

train_x, train_y, count_vectorizer  = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, ngrams = 1)

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y), ignore_index = True)

我收到此错误:

---> 12 full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, ), ignore_index = True) 
---> 14     y_pred = clf.predict(X_test_naive)

ValueError: dimension mismatch

错误中提到的函数是:

def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() 
    clf.fit(X_train_naive, y_train_naive)
    y_pred = clf.predict(X_test_naive)
        
    return 

任何有助于理解我的新定义和/或将 tf-idf 应用于我的数据集的错误(请参阅此处了解相关部分:在 ML 分类器中编码文本),不胜感激.

Any help in understanding what is wrong in my new definition and/or in applying the tf-idf to my dataset (please refer here for the relevant parts: Encoding text in ML classifier), it would be appreciated.

更新:我认为这个问题/答案对帮助我找出问题也很有用:scikit-learn ValueError: 维度不匹配

Update: I think this question/answer might be useful as well for helping me in figure out the issue: scikit-learn ValueError: dimension mismatch

如果我将 test_x, test_y = test_tfidf(testing_set, ngrams = 1) 替换为 test_x, test_y = test_tfidf(undersample_train, ngrams = 1) 它不会返回任何错误.但是,我认为这是不对的,因为我得到的值非常高(所有统计数据为 99%)

if I replace test_x, test_y = test_tfidf(testing_set, ngrams = 1) with test_x, test_y = test_tfidf(undersample_train, ngrams = 1) it does not return any error. However, I do not think it is right, as I am getting values very very high (99% on all statistics)

推荐答案

使用变换(在本例中为TfidfVectorizer)时,必须使用相同的对象来变换训练和测试数据.转换器通常仅使用训练数据进行拟合,然后重新用于转换测试数据.

When using transformes (TfidfVectorizer in this case), you must use the same object ot transform both train and test data. The transformer is typically fitted using the training data only, and then re-used to transform the test data.

在您的情况下执行此操作的正确方法:

The correct way to do this in your case:

def tfidf(data, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
    tfidf_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = tfidf_vectorizer.transform(list_corpus)
    
    return X, list_labels, tfidf_vectorizer


def test_tfidf(data, vectorizer, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)

    # No need to create a new TfidfVectorizer here!

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = vectorizer.transform(list_corpus)
    
    return X, list_labels

# this method is copied from the other SO question
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() # Gaussian Naive Bayes
    clf.fit(X_train_naive, y_train_naive)

    res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
    
    y_pred = clf.predict(X_test_naive)
    
    f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
    pres = precision_score(y_pred, y_test_naive, average = 'weighted')
    rec = recall_score(y_pred, y_test_naive, average = 'weighted')
    acc = accuracy_score(y_pred, y_test_naive)
    
    res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 
                     'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)

    return res 

train_x, train_y, count_vectorizer  = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, count_vectorizer, ngrams = 1)

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, count_vectorizer), ignore_index = True)

这篇关于当我尝试将 tf-idf 应用于测试集时维度不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆