如何将以下功能添加到tfidf矩阵? [英] how to add the following feature to a tfidf matrix?

查看：156 发布时间：2020/5/18 21:15:57 numpy scikit-learn

本文介绍了如何将以下功能添加到tfidf矩阵?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您好，我有一个名为list_cluster的列表，如下所示:

Hello I have a list called list_cluster, that looks as follows:

list_cluster=["hello,this","this is a test","the car is red",...]

我正在使用TfidfVectorizer生成如下模型:

I am using TfidfVectorizer to produce a model as follows:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
with open('vectorizerTFIDF.pickle', 'rb') as infile:
    tdf = pickle.load(infile)
tfidf2 = tdf.transform(list_cluster)

然后，我想向此矩阵添加新功能，称为tfidf2，我的列表如下:

then I would like to add new features to this matrix called tfidf2, I have a list as follows:

dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]

此列表的长度与list_cluster相同，并且表示日期具有12个位置，并且在位置1是一年中的相应月份，

this list has the same lenght of list_cluster, and represents the date has 12 positions and in the place where is the 1 is the corresponding month of the year,

例如'010000000000'代表2月，

for instance '010000000000' represents february,

为了先将其用作功能，我尝试过:

in order to use it as feature first I tried:

import numpy as np
dates=np.array(listMonth)
dates=np.transpose(dates)

获取一个numpy数组，然后对其进行转置，以便将其与第一个矩阵tfidf2连接起来

to get a numpy array and then to transpose it in order to concatenate it with the first matrix tfidf2

print("shape tfidf2: "+str(tfidf2.shape),"shape dates: "+str(dates.shape))

为了连接我的向量和矩阵，我尝试过:

in order to concatenate my vector and matrix I tried:

tfidf2=np.hstack((tfidf2,dates[:,None]))

但是这是输出:

shape tfidf2: (11159, 1927) shape dates: (11159,)
Traceback (most recent call last):
  File "Main.py", line 230, in <module>
    tfidf2=np.hstack((tfidf2,dates[:,None]))
  File "/usr/local/lib/python3.5/dist-packages/numpy/core/shape_base.py", line 278, in hstack
    return _nx.concatenate(arrs, 0)
ValueError: all the input arrays must have same number of dimensions

形状看起来不错，但是我不确定出现什么问题，我想感谢支持将此功能连接到我的tfidf2矩阵，在此先感谢您的关注，

the shape seems good, but I am not sure what is failing, I would like to appreciate support to concatenate this feature to my tfidf2 matrix, thanks in advance for the atention,

推荐答案

您需要将sklearn的所有字符串都转换为数字.一种方法是在sklearn的预处理模块中使用LabelBinarizer类.这样会为原始列中的每个唯一值创建一个新的二进制列.

You need to convert all strings to numerics for sklearn. One way to do this is use the LabelBinarizer class in the preprocessing module of sklearn. This creates a new binary column for each unique value in your original column.

如果日期与tfidf2相同，则我认为这行得通.

If dates is the same number of rows as tfidf2 then I think this will work.

# create tfidf2
tfidf2 = tdf.transform(list_cluster)

#create dates
dates=['010000000000', '001000000000', '001000000000', '000000000001', '001000000000', '000000000010',...]

# binarize dates
lb = LabelBinarizer()
b_dates = lb.fit_transform(dates)

new_tfidf = np.concatenate((tfidf2, b_dates), axis=1)

这篇关于如何将以下功能添加到tfidf矩阵?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将以下功能添加到tfidf矩阵? [英] how to add the following feature to a tfidf matrix?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将以下功能添加到tfidf矩阵? [英] how to add the following feature to a tfidf matrix?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭