如何比较使用scikit-learn库load_svmlight_file存储的2个稀疏矩阵? [英] How to compare 2 sparse matrix stored using scikit-learn library load_svmlight_file?

查看:915
本文介绍了如何比较使用scikit-learn库load_svmlight_file存储的2个稀疏矩阵?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试比较测试和训练数据集中存在的特征向量.这些特征向量使用scikitlearn库load_svmlight_file以稀疏格式存储.两个数据集的特征向量的维数相同. :具有多个元素的数组的真值是不明确的.请使用a.any()或a.all()."

为什么会出现此错误? 我该如何解决?

提前谢谢!

from sklearn.datasets import load_svmlight_file
pathToTrainData="../train.txt"
pathToTestData="../test.txt"
X_train,Y_train= load_svmlight_file(pathToTrainData);
X_test,Y_test= load_svmlight_file(pathToTestData);

for ele1 in X_train:
    for ele2 in X_test:
        if(ele1==ele2):
           print "same vector"


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-c1f145f984a6> in <module>()
      7 for ele1 in X_train:
      8     for ele2 in X_test:
----> 9         if(ele1==ele2):
     10            print "same vector"

/Users/rkasat/anaconda/lib/python2.7/site-packages/scipy/sparse/base.pyc in __bool__(self)
    181             return True if self.nnz == 1 else False
    182         else:
--> 183             raise ValueError("The truth value of an array with more than one "
    184                              "element is ambiguous. Use a.any() or a.all().")
    185     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

解决方案

您可以使用此条件来检查两个稀疏数组是否完全相等,而无需对其进行致密化:

if (ele1 - ele2).nnz == 0:
    # Matched, do something ...

nnz属性给出稀疏数组中非零元素的数量.

一些简单的测试可以显示出差异:

import numpy as np
from scipy import sparse

A = sparse.rand(10, 1000000).tocsr()

def benchmark1(A):
    for s1 in A:
        for s2 in A:
            if (s1 - s2).nnz == 0:
                pass

def benchmark2(A):
    for s1 in A:
        for s2 in A:
            if (s1.toarray() == s2).all() == 0:
                pass

%timeit benchmark1(A)
%timeit benchmark2(A)

一些结果:

# Computer 1
10 loops, best of 3: 36.9 ms per loop # with nnz
1 loops, best of 3: 734 ms per loop # with toarray

# Computer 2
10 loops, best of 3: 28 ms per loop
1 loops, best of 3: 312 ms per loop

i am trying to compare feature vectors present in test and train data set.These feature vectors are stored in sparse format using scikitlearn library load_svmlight_file.The dimension of feature vectors of both the dataset is same.However,I am getting this error :"The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()."

Why am I getting this error? How can I resolve it?

Thanks in advance!

from sklearn.datasets import load_svmlight_file
pathToTrainData="../train.txt"
pathToTestData="../test.txt"
X_train,Y_train= load_svmlight_file(pathToTrainData);
X_test,Y_test= load_svmlight_file(pathToTestData);

for ele1 in X_train:
    for ele2 in X_test:
        if(ele1==ele2):
           print "same vector"


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-c1f145f984a6> in <module>()
      7 for ele1 in X_train:
      8     for ele2 in X_test:
----> 9         if(ele1==ele2):
     10            print "same vector"

/Users/rkasat/anaconda/lib/python2.7/site-packages/scipy/sparse/base.pyc in __bool__(self)
    181             return True if self.nnz == 1 else False
    182         else:
--> 183             raise ValueError("The truth value of an array with more than one "
    184                              "element is ambiguous. Use a.any() or a.all().")
    185     __nonzero__ = __bool__

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

解决方案

You can use this condition to check whether the two sparse arrays are exactly equal without needing to densify them:

if (ele1 - ele2).nnz == 0:
    # Matched, do something ...

The nnz attribute gives the number of nonzero elements in the sparse array.

Some simple test runs to show the difference:

import numpy as np
from scipy import sparse

A = sparse.rand(10, 1000000).tocsr()

def benchmark1(A):
    for s1 in A:
        for s2 in A:
            if (s1 - s2).nnz == 0:
                pass

def benchmark2(A):
    for s1 in A:
        for s2 in A:
            if (s1.toarray() == s2).all() == 0:
                pass

%timeit benchmark1(A)
%timeit benchmark2(A)

Some results:

# Computer 1
10 loops, best of 3: 36.9 ms per loop # with nnz
1 loops, best of 3: 734 ms per loop # with toarray

# Computer 2
10 loops, best of 3: 28 ms per loop
1 loops, best of 3: 312 ms per loop

这篇关于如何比较使用scikit-learn库load_svmlight_file存储的2个稀疏矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆