ValueError:数组长度与索引长度不匹配 [英] ValueError: array length does not match index length

查看:1368
本文介绍了ValueError:数组长度与索引长度不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为像kaggle这样的比赛做练习,我一直在尝试使用XGBoost,并试图使自己熟悉python 3rd party库,例如pandas和numpy.

I am practicing for contests like kaggle and I have been trying to use XGBoost and am trying to get myself familiar with python 3rd party libraries like pandas and numpy.

我一直在审查该特定竞赛中称为桑坦德客户满意度分类"的脚本,并且我一直在修改不同的分叉脚本以进行试验.

I have been reviewing scripts from this particular competition called the Santander Customer Satisfaction Classification and I have been modifying different forked scripts in order to experiment on them.

这是我尝试通过其实现XGBoost的一种修改后的脚本:

Here is one modified script through which I am trying to implement XGBoost:

import pandas as pd
from sklearn import cross_validation as cv
import xgboost as xgb

df_train = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/train.csv")
df_test  = pd.read_csv("/Users/pavan7vasan/Desktop/Machine_Learning/Project Datasets/Santander_Customer_Satisfaction/test.csv")   

df_train = df_train.replace(-999999,2)

id_test = df_test['ID']
y_train = df_train['TARGET'].values
X_train = df_train.drop(['ID','TARGET'], axis=1).values
X_test = df_test.drop(['ID'], axis=1).values

X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)

clf = xgb.XGBClassifier(objective='binary:logistic',
                missing=9999999999,
                max_depth = 7,
                n_estimators=200,
                learning_rate=0.1, 
                nthread=4,
                subsample=1.0,
                colsample_bytree=0.5,
                min_child_weight = 3,
                reg_alpha=0.01,
                seed=7)

clf.fit(X_train, y_train, early_stopping_rounds=50, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)])
y_pred = clf.predict_proba(X_test)

print("Cross validating and checking the score...")
scores = cv.cross_val_score(clf, X_train, y_train) 
'''
test = []
result = []
for each in id_test:
    test.append(each)
for each in y_pred[:,1]:
    result.append(each)

print len(test)
print len(result)
'''
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
#submission = pd.DataFrame({"ID":test, "TARGET":result})
submission.to_csv("submission_XGB_Pavan.csv", index=False)

这是stacktrace:

Here is the stacktrace :

Traceback (most recent call last):
  File "/Users/pavan7vasan/Documents/workspace/Machine_Learning_Project/Kaggle/XG_Boost.py", line 45, in <module>
submission = pd.DataFrame({"ID":id_test, "TARGET":y_pred[:,1]})
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 214, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 341, in _init_dict
dtype=dtype)
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4798, in _arrays_to_mgr
index = extract_index(arrays)
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 4856, in extract_index
raise ValueError(msg)
ValueError: array length 30408 does not match index length 75818

我已经根据对不同解决方案的搜索尝试了解决方案,但是我无法弄清楚错误是什么.我弄错了什么?请让我知道

I have tried solutions based on my searches for different solutions, but I am not able to figure out what the mistake is. What is it that I have gone wrong in? Please let me know

推荐答案

问题是您定义的X_test是@maxymoo的两倍.首先,您将其定义为

The problem is that you defining X_test twice as @maxymoo mentioned. First you defined it as

X_test = df_test.drop(['ID'], axis=1).values

然后使用以下命令重新定义它:

And then you redefine that with:

X_train, X_test, y_train, y_test = cv.train_test_split(X_train, y_train, random_state=1301, test_size=0.4)

这意味着现在X_test的大小等于0.4*len(X_train).然后之后:

Which means now X_test have size equal to 0.4*len(X_train). Then after:

y_pred = clf.predict_proba(X_test)

您已经对X_train的那部分进行了预测,并尝试使用该数据帧和初始id_test创建数据帧,该数据帧的长度与原始X_test相同.
您可以在train_test_split中使用X_fitX_eval而不隐藏初始的X_trainX_test,因为对于cross_validation,您还具有不同的X_train,这意味着您将无法获得正确的答案,或者cv与公共/私人分数不符.

you've got predictions for that part of X_train and you trying to create dataframe with that and initial id_test which has length of the original X_test.
You could use X_fit and X_eval in train_test_split and not hide initial X_train and X_test because for your cross_validation you also has different X_train which means you'll not get right answer or you cv would be inaccurate with public/private score.

这篇关于ValueError:数组长度与索引长度不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆