使用scipy稀疏矩阵和numpy数组训练sklearn ML模型 [英] Train `sklearn` ML model with scipy sparse matrix and numpy array

查看:602
本文介绍了使用scipy稀疏矩阵和numpy数组训练sklearn ML模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

仅是为了解释一些用例,A是一个具有tf-idf值的稀疏矩阵,而B是一个具有我的数据某些附加功能的数组.

Just to explain some things more about my use case, A is a sparse matrix with tf-idf values and B is an array with some additional features of my data.

我已经划分为训练和测试集,因此在我的示例中,AB仅与训练集有关.我(想)在这段代码之后对测试集做同样的事情.

I have already splitted to training and test sets so A and B in my example are only about the training set. I (want to) do the same for the test set after this code.

我想将这些矩阵/数组连接起来,因为然后我想将它们传递给sklearn ML模型以对其进行训练,但我认为我不能单独传递它们.

I want to concatenate these matrices/arrays because then I want to pass them to a sklearn ML model to train it and I do not think that I can pass them separately.

所以我尝试这样做:

C = np.concatenate((A, B.T), axis=1)

其中A是<class 'scipy.sparse.csr.csr_matrix'>,B是<class 'numpy.ndarray'>.

但是,当我尝试执行此操作时,出现以下错误:

However, when I try to do this then I get the following error:

ValueError: zero-dimensional arrays cannot be concatenated

此外,我不认为 np.concatenate`具有稀疏矩阵的numpy数组在我的情况下非常好,因为

Also, I do not think that the idea of `np.concatenate` a numpy array with a sparse matrix is very good in my case because

  1. 基本上不可能将稀疏数组A转换为密集数组,因为它太大了
  2. 如果将完全密集的数组B转换为稀疏数组,我会丢失(或实际上不是?)信息
  1. it is basically impossible to covert my sparse array A to a dense array because it is too big
  2. I will lose (or not actually??) information if I convert my fully dense array B to a sparse array

将稀疏和完全密集的由行连接的数组传递给sklearn ML模型的最佳方法是什么?

What is the best way to pass to an sklearn ML model a sparse and a fully dense array concatenated by rows?

推荐答案

  1. 您可以使用 hstack . hstack会将两个矩阵都转换为scipy coo_matrix ,将它们合并并默认返回一个coo_matrix.

  1. You can use hstack from scipy. hstack will convert both matrices to scipy coo_matrix, merge them and return a coo_matrix by default.

将密集数组转换为稀疏数组时,不会丢失任何信息.稀疏矩阵只是紧凑的数据存储格式.另外,除非为hstack的参数dtype指定值,否则所有内容都是已上传.因此,那里也没有数据丢失的可能性.

No information is lost when converting dense array to sparse. Sparse matrices are just compact data storage format. Also, unless to specify a value for argument dtype of hstack everything is upcasted. So, there is no possibility of data loss there as well.

进一步,如果您打算使用sklearn中的Logistic回归,则稀疏矩阵必须采用 csr 格式,以使fit方法起作用.

Further, if you plan to use Logistic Regression from sklearn, sparse matrices must be in csr format for fit method to work.

以下代码应适合您的用例

The following code should work for your use-case

from scipy.sparse import hstack

X = hstack((A, B), format='csr')

这篇关于使用scipy稀疏矩阵和numpy数组训练sklearn ML模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆