查询数据维度必须与训练数据维度匹配 [英] Query data dimension must match training data dimension

查看:130
本文介绍了查询数据维度必须与训练数据维度匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个推特分类器.我训练了带有tfidf数据集的knn clasiffier,其中每行的长度为3.173,在训练了模型之后,将其加载到文件中以便对新的tweet进行分类.

I'm developing a tweet classifier. I trained a knn clasiffier with a a tfidf dataset in which each row has a length of 3.173, after training the model a load it into a file so that I can classify new tweets.

问题在于,每当我提取新的tweet并尝试对它们进行分类时,tfidf的长度取决于新提取的tweet的词汇表,因此该模型无法对这些新的tweet进行分类.

The problem is that every time I extract new tweets and try to classify them the tfidf lenths varies dependending on the vocabulary of new extracted tweets, so it is impossible for the model to classify those new tweets.

我已经搜索并尝试解决这一问题两天了,但是没有找到有效的解决方案.如何有效地将查询数据的维度调整为训练数据的维度?

I've been searching and trying to solve this for two days but did not find an efficient solution. How can I adapt the dimension of the querying data to the dimension of the training data efficiently???

这是我的代码:

 #CLASIFICA TWEETS TASS TEST
    clf = joblib.load('files/model_knn_pos.sav')

    #Carga los tweets
    dfNew = pd.read_csv(f'files/tweetsTASStestCaract.csv', encoding='UTF-8',sep='|')

    #Preprocesa 
    prepro = Preprocesado()
    dfNew['clean_text'] = prepro.procesa(dfNew['tweet'])

    #Tercer excluso
    dfNew['type'].replace(['NEU','N','NONE'], 'NoPos', inplace=True)

    #Funcion auxiliar para crear los vectores
    def tokenize(s):
        return s.split()

    #Creo un vector por cada tweet, tendré en cuenta las palabras q aparezcan al menos 3 veces
    vect = TfidfVectorizer(tokenizer=tokenize, ngram_range=(1, 2), max_df=0.75, min_df=3, sublinear_tf=True)
    muestra = vect.fit_transform(dfNew['clean_text']).toarray().tolist()

    #Caracterizo los tweets a clasificar
    for i in range(len(muestra)):
            caract=dfNew.drop(columns=['tweet','clean_text','type']).values[i]
            muestra[i].extend(caract)

    #Clasifica pos
    y_train=dfNew['type'].values
    resultsPos = clf.predict(muestra)
    print(Counter(resultsPos))  

这是我得到的错误:

文件"sklearn/neighbors/binary_tree.pxi",第1294行,在sklearn.neighbors.kd_tree.BinaryTree.query

File "sklearn/neighbors/binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query

ValueError:查询数据维必须与训练数据维匹配

ValueError: query data dimension must match training data dimension

推荐答案

解决方案很简单:

您需要对训练数据使用 vect.fit_transform().但是,使用测试数据时,只需使用 vect.transform().

You need to use vect.fit_transform() with the training data. But, when using the test data, you need only to use vect.transform().

这篇关于查询数据维度必须与训练数据维度匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆