查询数据维度必须与训练数据维度匹配 [英] Query data dimension must match training data dimension

查看：130 发布时间：2021/5/28 19:34:56 python scikit-learn nlp knn tweets

本文介绍了查询数据维度必须与训练数据维度匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在开发一个推特分类器.我训练了带有tfidf数据集的knn clasiffier，其中每行的长度为3.173，在训练了模型之后，将其加载到文件中以便对新的tweet进行分类.

I'm developing a tweet classifier. I trained a knn clasiffier with a a tfidf dataset in which each row has a length of 3.173, after training the model a load it into a file so that I can classify new tweets.

问题在于，每当我提取新的tweet并尝试对它们进行分类时，tfidf的长度取决于新提取的tweet的词汇表，因此该模型无法对这些新的tweet进行分类.

The problem is that every time I extract new tweets and try to classify them the tfidf lenths varies dependending on the vocabulary of new extracted tweets, so it is impossible for the model to classify those new tweets.

我已经搜索并尝试解决这一问题两天了，但是没有找到有效的解决方案.如何有效地将查询数据的维度调整为训练数据的维度?

I've been searching and trying to solve this for two days but did not find an efficient solution. How can I adapt the dimension of the querying data to the dimension of the training data efficiently???

这是我的代码:

 #CLASIFICA TWEETS TASS TEST
    clf = joblib.load('files/model_knn_pos.sav')

    #Carga los tweets
    dfNew = pd.read_csv(f'files/tweetsTASStestCaract.csv', encoding='UTF-8',sep='|')

    #Preprocesa 
    prepro = Preprocesado()
    dfNew['clean_text'] = prepro.procesa(dfNew['tweet'])

    #Tercer excluso
    dfNew['type'].replace(['NEU','N','NONE'], 'NoPos', inplace=True)

    #Funcion auxiliar para crear los vectores
    def tokenize(s):
        return s.split()

    #Creo un vector por cada tweet, tendré en cuenta las palabras q aparezcan al menos 3 veces
    vect = TfidfVectorizer(tokenizer=tokenize, ngram_range=(1, 2), max_df=0.75, min_df=3, sublinear_tf=True)
    muestra = vect.fit_transform(dfNew['clean_text']).toarray().tolist()

    #Caracterizo los tweets a clasificar
    for i in range(len(muestra)):
            caract=dfNew.drop(columns=['tweet','clean_text','type']).values[i]
            muestra[i].extend(caract)

    #Clasifica pos
    y_train=dfNew['type'].values
    resultsPos = clf.predict(muestra)
    print(Counter(resultsPos))

这是我得到的错误:

文件"sklearn/neighbors/binary_tree.pxi"，第1294行，在sklearn.neighbors.kd_tree.BinaryTree.query

File "sklearn/neighbors/binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query

ValueError:查询数据维必须与训练数据维匹配

ValueError: query data dimension must match training data dimension

查询数据维度必须与训练数据维度匹配 [英] Query data dimension must match training data dimension

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

查询数据维度必须与训练数据维度匹配 [英] Query data dimension must match training data dimension

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭