numpy的hstack - ＆QUOT; ValueError错误：所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的 [英] Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

查看：11942 发布时间：2016/6/1 20:08:00 python arrays numpy pandas scikit-learn

本文介绍了numpy的hstack - ＆QUOT; ValueError错误：所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图连接两个numpy的阵列。在一个我对文本的一列运行的TF-IDF之后，一组列/功能。在其他的我有一列/特征是一个整数。所以我在训练和测试数据，在此运行TF-IDF的一列中读取，然后我想补充另一个整数列，因为我认为这将有助于我的学习分类更准确地应该如何行事。

不幸的是，我得到错误的标题时，我尝试和运行 hstack 这个单个列添加到我的其他numpy的数组。

下面是我的code：

  #reading测试/火车数据TF-IDF
  traindata =列表（np.array（p.read_csv（'FinalCSVFin.csv'，分隔符=））:, 2]）
  TESTDATA =列表（np.array（p.read_csv（'FinalTestCSVFin.csv'，分隔符=））:, 2]）  #reading在标签培训
  Y = np.array（p.read_csv（'FinalCSVFin.csv'，定界符=;））[：， -  2]  单整数列#reading加盟
  AlexaTrainData = p.read_csv（'FinalCSVFin.csv'，分隔符=;）[alexarank]]
  AlexaTestData = p.read_csv（'FinalTestCSVFin.csv'，分隔符=;）[alexarank]]
  AllAlexaAndGoogleInfo = AlexaTestData.append（AlexaTrainData）  TFV = TfidfVectorizer（min_df = 3，max_features =无，strip_accents ='单向code'，
        分析器='单词'，token_pattern = R'\\ W {1，}，ngram_range =（1，2），use_idf = 1，smooth_idf = 1，sublinear_tf = 1）＃TF-IDF对象
  RD = lm.LogisticRegression（罚分='12'，双= TRUE，TOL = 0.0001，
                             C = 1，fit_intercept = TRUE，intercept_scaling = 1.0，
                             class_weight =无，random_state =无）#Classifier
  X_all = traindata + TESTDATA #adding测试和训练数据放入TF-IDF
  lentrain = LEN（traindata）#find训练数据的长度
  tfv.fit（X_all）#fit我们所有文字TF-IDF
  X_all = tfv.transform（X_all）#transform它
  X = X_all [：lentrain] #reduce到训练集的大小
  AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo [：lentrain] #reduce到训练集的大小
  X_test = X_all [lentrain：] #reduce到训练集的大小  #printing调试信息，下面的输出：
  打印X.shape =＆gt;中+ STR（X.shape）
  打印AllAlexaAndGoogleInfo.shape =＆gt;中+ STR（AllAlexaAndGoogleInfo.shape）
  打印X_all.shape =＆gt;中+ STR（X_all.shape）  我们的#line得到错误的
  X = np.hstack（（X，AllAlexaAndGoogleInfo））

下面是输出和错误消息：

  X.shape =＆GT; （7395，238377）
AllAlexaAndGoogleInfo.shape =＆GT; （7395，1）
X_all.shape =＆GT; （10566，238377）-------------------------------------------------- -------------------------
ValueError错误回溯（最新最后调用）
＆LT; IPython的输入-12-2b310887b5e4＆GT;上述＆lt;模块＆GT;（）
     31打印X_all.shape =＆gt;中+ STR（X_all.shape）
     32＃X = np.column_stack（（X，AllAlexaAndGoogleInfo））
---＆GT; 33 X = np.hstack（（X，AllAlexaAndGoogleInfo））
     34 SC = preprocessing.StandardScaler（）。拟合（X）
     35 X = sc.transform（X）C：\\用户\\西蒙\\蟒蛇\\ LIB \\站点包\\ numpy的\\在hstack核心\\ shape_base.pyc（TUP）
    271＃作为一个特殊的情况下，一维数组的大小0为横
    272如果ARRS [0] .ndim == 1：
 - ＆GT; 273返回_nx.concatenate（ARRS，0）
    274其他：
    275回_nx.concatenate（ARRS，1）ValueError错误：所有输入数组必须具有相同的维数

这是怎么造成我的问题吗？我该如何解决这个问题？至于我可以看到我应该能够加入这些列？我有什么误解？

感谢您。

编辑：

在下面的答案使用方法得到以下错误：

  -------------------------------------- -------------------------------------
ValueError错误回溯（最新最后调用）
＆LT; IPython的输入-16-640ef6dd335d＆GT;上述＆lt;模块＆GT;（）
---＆GT; 36 X = np.column_stack（（X，AllAlexaAndGoogleInfo））
     37 SC = preprocessing.StandardScaler（）。拟合（X）
     38 X = sc.transform（X）C：\\用户\\西蒙\\蟒蛇\\ LIB \\站点包\\ numpy的\\ lib目录\\ shape_base.pyc在column_stack（TUP）
    294 ARR =阵列（改编，复制=假，subok = TRUE，ndmin = 2）.T
    295 arrays.append（ARR）
 - ＆GT; 296回_nx.concatenate（阵列，1）
    297
    298高清dstack（TUP）：ValueError错误：所有除串联轴必须完全匹配输入数组维

有趣的是，我试图打印X的 DTYPE 这正常工作：

  X.dtype =＆GT; float64

但是，试图打印 AllAlexaAndGoogleInfo 的DTYPE像这样：

 打印AllAlexaAndGoogleInfo.dtype =＆gt;中+ STR（AllAlexaAndGoogleInfo.dtype）

生产：

 '数据帧'对象有没有属性'DTYPE

解决方案

由于 X 是代替 numpy.hstack ，使用 scipy.sparse.hstack 加入阵列。在我看来，错误信息是种误导这里。

这个小例子说明了情况：

 导入numpy的是NP
从SciPy的进口稀疏X = sparse.rand（10，10000）
XT = np.random.random（（10，1））
打印X形状：'X.shape
打印'XT形状：'，xt.shape
打印堆叠的形状：'，np.hstack（（X，XT））的形状。
#PRINT'堆叠的形状：'，sparse.hstack（（X，XT））形状＃此作品。

根据以下输出

  X形状：（10，10000）
XT形状：（10，1）

人们可以想到的是， hstack 在以下行的工作，但事实是，它抛出这个错误：

  ValueError错误：所有输入数组必须具有相同的维数

所以，使用 scipy.sparse.hstack 当你有一个稀疏数组堆栈。

其实我已经回答了这个作为你的另一问题进行评论，你提到另外一个错误信息弹出：

 类型错误：对类型没有支持的转换：（DTYPE（'float64'），DTYPE（'O'））

首先， AllAlexaAndGoogleInfo 不具有 DTYPE ，因为它是一个数据框。为了得到它的底层numpy的数组，简单地使用 AllAlexaAndGoogleInfo.values 。检查它的 DTYPE 。基于错误信息，它有一个 DTYPE 的对象，这意味着它可能包含非数值元素如字符串。

这是重现这种情况小例子：

  X = sparse.rand（100，10000）
XT = np.random.random（（100，1））
XT = xt.astype（对象）＃评论这个修正错误
打印'X：'，X.shape，X.dtype
打印XT：'，xt.shape，xt.dtype
打印堆叠的形状：'，sparse.hstack（（X，XT））的形状。

错误消息：

 类型错误：对类型没有支持的转换：（DTYPE（'float64'），DTYPE（'O'））

所以，检查是否有在 AllAlexaAndGoogleInfo 任何非数值并修复它们，做叠加前。

I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.

Unfortunately, I am getting the error in the title when I try and run hstack to add this single column to my other numpy array.

Here is my code :

  #reading in test/train data for TF-IDF
  traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
  testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

  #reading in labels for training
  y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

  #reading in single integer column to join
  AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
  AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
  AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) #Classifier
  X_all = traindata + testdata #adding test and train data to put into tf-idf
  lentrain = len(traindata) #find length of train data
  tfv.fit(X_all) #fit tf-idf on all our text
  X_all = tfv.transform(X_all) #transform it
  X = X_all[:lentrain] #reduce to size of training set
  AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
  X_test = X_all[lentrain:] #reduce to size of training set

  #printing debug info, output below : 
  print "X.shape => " + str(X.shape)
  print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
  print "X_all.shape => " + str(X_all.shape)

  #line we get error on
  X = np.hstack((X, AllAlexaAndGoogleInfo))

Below is the output and error message :

X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
     31 print "X_all.shape => " + str(X_all.shape)
     32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
     34 sc = preprocessing.StandardScaler().fit(X)
     35 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
    271     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    272     if arrs[0].ndim == 1:
--> 273         return _nx.concatenate(arrs, 0)
    274     else:
    275         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions

What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?

Thank you.

Edit :

Using the method in the answer below gets the following error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
     37 sc = preprocessing.StandardScaler().fit(X)
     38 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
    294             arr = array(arr,copy=False,subok=True,ndmin=2).T
    295         arrays.append(arr)
--> 296     return _nx.concatenate(arrays,1)
    297 
    298 def dstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Interestingly, I tried to print the dtype of X and this worked fine :

X.dtype => float64

However, trying to print the dtype of AllAlexaAndGoogleInfo like so :

print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)

produces :

'DataFrame' object has no attribute 'dtype'

解决方案

As X is a sparse array, instead of numpy.hstack, use scipy.sparse.hstack to join the arrays. In my opinion the error message is kind of misleading here.

This minimal example illustrates the situation:

import numpy as np
from scipy import sparse

X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works

Based on the following output

X shape: (10, 10000)
xt shape: (10, 1)

one may expect that the hstack in the following line will work, but the fact is that it throws this error:

ValueError: all the input arrays must have same number of dimensions

So, use scipy.sparse.hstack when you have a sparse array to stack.

In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

First of all, AllAlexaAndGoogleInfo does not have a dtype as it is a DataFrame. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values. Check its dtype. Based on the error message, it has a dtype of object, which means that it might contain non-numerical elements like strings.

This is a minimal example that reproduces this situation:

X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape

The error message:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

So, check if there is any non-numerical values in AllAlexaAndGoogleInfo and repair them, before doing the stacking.

这篇关于numpy的hstack - ＆QUOT; ValueError错误：所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

numpy的hstack - ＆QUOT; ValueError错误：所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的 [英] Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

numpy的hstack - ＆QUOT; ValueError错误：所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的 [英] Numpy hstack - &quot;ValueError: all the input arrays must have same number of dimensions&quot; - but they do

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

numpy的hstack - ＆QUOT; ValueError错误：所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的 [英] Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

登录关闭