numpy的hstack - " ValueError错误:所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的 [英] Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do
问题描述
我试图连接两个numpy的阵列。在一个我对文本的一列运行的TF-IDF之后,一组列/功能。在其他的我有一列/特征是一个整数。所以我在训练和测试数据,在此运行TF-IDF的一列中读取,然后我想补充另一个整数列,因为我认为这将有助于我的学习分类更准确地应该如何行事。
不幸的是,我得到错误的标题时,我尝试和运行 hstack
这个单个列添加到我的其他numpy的数组。
下面是我的code:
#reading测试/火车数据TF-IDF
traindata =列表(np.array(p.read_csv('FinalCSVFin.csv',分隔符=)):, 2])
TESTDATA =列表(np.array(p.read_csv('FinalTestCSVFin.csv',分隔符=)):, 2]) #reading在标签培训
Y = np.array(p.read_csv('FinalCSVFin.csv',定界符=;))[:, - 2] 单整数列#reading加盟
AlexaTrainData = p.read_csv('FinalCSVFin.csv',分隔符=;)[alexarank]]
AlexaTestData = p.read_csv('FinalTestCSVFin.csv',分隔符=;)[alexarank]]
AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData) TFV = TfidfVectorizer(min_df = 3,max_features =无,strip_accents ='单向code',
分析器='单词',token_pattern = R'\\ W {1,},ngram_range =(1,2),use_idf = 1,smooth_idf = 1,sublinear_tf = 1)#TF-IDF对象
RD = lm.LogisticRegression(罚分='12',双= TRUE,TOL = 0.0001,
C = 1,fit_intercept = TRUE,intercept_scaling = 1.0,
class_weight =无,random_state =无)#Classifier
X_all = traindata + TESTDATA #adding测试和训练数据放入TF-IDF
lentrain = LEN(traindata)#find训练数据的长度
tfv.fit(X_all)#fit我们所有文字TF-IDF
X_all = tfv.transform(X_all)#transform它
X = X_all [:lentrain] #reduce到训练集的大小
AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo [:lentrain] #reduce到训练集的大小
X_test = X_all [lentrain:] #reduce到训练集的大小 #printing调试信息,下面的输出:
打印X.shape =>中+ STR(X.shape)
打印AllAlexaAndGoogleInfo.shape =>中+ STR(AllAlexaAndGoogleInfo.shape)
打印X_all.shape =>中+ STR(X_all.shape) 我们的#line得到错误的
X = np.hstack((X,AllAlexaAndGoogleInfo))
下面是输出和错误消息:
X.shape => (7395,238377)
AllAlexaAndGoogleInfo.shape => (7395,1)
X_all.shape => (10566,238377)-------------------------------------------------- -------------------------
ValueError错误回溯(最新最后调用)
< IPython的输入-12-2b310887b5e4>上述<模块>()
31打印X_all.shape =>中+ STR(X_all.shape)
32#X = np.column_stack((X,AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X,AllAlexaAndGoogleInfo))
34 SC = preprocessing.StandardScaler()。拟合(X)
35 X = sc.transform(X)C:\\用户\\西蒙\\蟒蛇\\ LIB \\站点包\\ numpy的\\在hstack核心\\ shape_base.pyc(TUP)
271#作为一个特殊的情况下,一维数组的大小0为横
272如果ARRS [0] .ndim == 1:
- > 273返回_nx.concatenate(ARRS,0)
274其他:
275回_nx.concatenate(ARRS,1)ValueError错误:所有输入数组必须具有相同的维数
这是怎么造成我的问题吗?我该如何解决这个问题?至于我可以看到我应该能够加入这些列?我有什么误解?
感谢您。
编辑:
在下面的答案使用方法得到以下错误:
-------------------------------------- -------------------------------------
ValueError错误回溯(最新最后调用)
< IPython的输入-16-640ef6dd335d>上述<模块>()
---> 36 X = np.column_stack((X,AllAlexaAndGoogleInfo))
37 SC = preprocessing.StandardScaler()。拟合(X)
38 X = sc.transform(X)C:\\用户\\西蒙\\蟒蛇\\ LIB \\站点包\\ numpy的\\ lib目录\\ shape_base.pyc在column_stack(TUP)
294 ARR =阵列(改编,复制=假,subok = TRUE,ndmin = 2).T
295 arrays.append(ARR)
- > 296回_nx.concatenate(阵列,1)
297
298高清dstack(TUP):ValueError错误:所有除串联轴必须完全匹配输入数组维
有趣的是,我试图打印X的 DTYPE
这正常工作:
X.dtype => float64
但是,试图打印 AllAlexaAndGoogleInfo
的DTYPE像这样:
打印AllAlexaAndGoogleInfo.dtype =>中+ STR(AllAlexaAndGoogleInfo.dtype)
生产:
'数据帧'对象有没有属性'DTYPE
由于 X
是代替 numpy.hstack 稀疏数组, code>,使用
scipy.sparse.hstack
加入阵列。在我看来,错误信息是种误导这里。
这个小例子说明了情况:
导入numpy的是NP
从SciPy的进口稀疏X = sparse.rand(10,10000)
XT = np.random.random((10,1))
打印X形状:'X.shape
打印'XT形状:',xt.shape
打印堆叠的形状:',np.hstack((X,XT))的形状。
#PRINT'堆叠的形状:',sparse.hstack((X,XT))形状#此作品。
根据以下输出
X形状:(10,10000)
XT形状:(10,1)
人们可以想到的是, hstack
在以下行的工作,但事实是,它抛出这个错误:
ValueError错误:所有输入数组必须具有相同的维数
所以,使用 scipy.sparse.hstack
当你有一个稀疏数组堆栈。
其实我已经回答了这个作为你的另一问题进行评论,你提到另外一个错误信息弹出:
类型错误:对类型没有支持的转换:(DTYPE('float64'),DTYPE('O'))
首先, AllAlexaAndGoogleInfo
不具有 DTYPE
,因为它是一个数据框
。为了得到它的底层numpy的数组,简单地使用 AllAlexaAndGoogleInfo.values
。检查它的 DTYPE
。基于错误信息,它有一个 DTYPE
的对象
,这意味着它可能包含非数值元素如字符串。
这是重现这种情况小例子:
X = sparse.rand(100,10000)
XT = np.random.random((100,1))
XT = xt.astype(对象)#评论这个修正错误
打印'X:',X.shape,X.dtype
打印XT:',xt.shape,xt.dtype
打印堆叠的形状:',sparse.hstack((X,XT))的形状。
错误消息:
类型错误:对类型没有支持的转换:(DTYPE('float64'),DTYPE('O'))
所以,检查是否有在 AllAlexaAndGoogleInfo
任何非数值并修复它们,做叠加前。
I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.
Unfortunately, I am getting the error in the title when I try and run hstack
to add this single column to my other numpy array.
Here is my code :
#reading in test/train data for TF-IDF
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])
#reading in labels for training
y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]
#reading in single integer column to join
AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001,
C=1, fit_intercept=True, intercept_scaling=1.0,
class_weight=None, random_state=None) #Classifier
X_all = traindata + testdata #adding test and train data to put into tf-idf
lentrain = len(traindata) #find length of train data
tfv.fit(X_all) #fit tf-idf on all our text
X_all = tfv.transform(X_all) #transform it
X = X_all[:lentrain] #reduce to size of training set
AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
X_test = X_all[lentrain:] #reduce to size of training set
#printing debug info, output below :
print "X.shape => " + str(X.shape)
print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
print "X_all.shape => " + str(X_all.shape)
#line we get error on
X = np.hstack((X, AllAlexaAndGoogleInfo))
Below is the output and error message :
X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
31 print "X_all.shape => " + str(X_all.shape)
32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
34 sc = preprocessing.StandardScaler().fit(X)
35 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
271 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
272 if arrs[0].ndim == 1:
--> 273 return _nx.concatenate(arrs, 0)
274 else:
275 return _nx.concatenate(arrs, 1)
ValueError: all the input arrays must have same number of dimensions
What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?
Thank you.
Edit :
Using the method in the answer below gets the following error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
37 sc = preprocessing.StandardScaler().fit(X)
38 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
294 arr = array(arr,copy=False,subok=True,ndmin=2).T
295 arrays.append(arr)
--> 296 return _nx.concatenate(arrays,1)
297
298 def dstack(tup):
ValueError: all the input array dimensions except for the concatenation axis must match exactly
Interestingly, I tried to print the dtype
of X and this worked fine :
X.dtype => float64
However, trying to print the dtype of AllAlexaAndGoogleInfo
like so :
print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)
produces :
'DataFrame' object has no attribute 'dtype'
As X
is a sparse array, instead of numpy.hstack
, use scipy.sparse.hstack
to join the arrays. In my opinion the error message is kind of misleading here.
This minimal example illustrates the situation:
import numpy as np
from scipy import sparse
X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works
Based on the following output
X shape: (10, 10000)
xt shape: (10, 1)
one may expect that the hstack
in the following line will work, but the fact is that it throws this error:
ValueError: all the input arrays must have same number of dimensions
So, use scipy.sparse.hstack
when you have a sparse array to stack.
In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
First of all, AllAlexaAndGoogleInfo
does not have a dtype
as it is a DataFrame
. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values
. Check its dtype
. Based on the error message, it has a dtype
of object
, which means that it might contain non-numerical elements like strings.
This is a minimal example that reproduces this situation:
X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape
The error message:
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
So, check if there is any non-numerical values in AllAlexaAndGoogleInfo
and repair them, before doing the stacking.
这篇关于numpy的hstack - &QUOT; ValueError错误:所有输入数组必须具有相同数量的尺寸和QUOT的; - 但他们做的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!