类似于Sklearn LogisticRegressionCV的输入的数组 [英] Array like input for Sklearn LogisticRegressionCV
问题描述
最初,我从.csv
文件读取数据,但是在这里我从列表构建数据框,以便可以重现该问题.目的是使用LogisticRegressionCV
训练具有交叉验证的逻辑回归模型.
Originally, I read the data from a .csv
file, but here I build the dataframe from lists so the problem can be reproduced. The aim is to train a logistic regression model with cross-validation using LogisticRegressionCV
.
indeps = ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F']
dep = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
data = [indeps, dep]
cols = ['state', 'cat_bins']
data_dict = dict((x[0], x[1]) for x in zip(cols, data))
df = pd.DataFrame.from_dict(data_dict)
df.tail()
cat_bins state
45 0.0 F
46 0.0 M
47 0.0 M
48 0.0 F
49 0.0 F
'''Use Pandas' to encode independent variables. Notice that
we are returning a sparse dataframe '''
def heat_it2(dataframe, lst_of_columns):
dataframe_hot = pd.get_dummies(dataframe,
prefix = lst_of_columns,
columns = lst_of_columns, sparse=True,)
return dataframe_hot
train_set_hot = heat_it2(df, ['state'])
train_set_hot.head(2)
cat_bins state_F state_M
0 1.0 0 1
1 1.0 1 0
'''Use the dataframe to set up the prospective inputs to the model as numpy arrays'''
indeps_hot = ['state_F', 'state_M']
X = train_set_hot[indeps_hot].values
y = train_set_hot['cat_bins'].values
print 'X-type:', X.shape, type(X)
print 'y-type:', y.shape, type(y)
print 'X has shape, is an array and has length:\n', hasattr(X, 'shape'), hasattr(X, '__array__'), hasattr(X, '__len__')
print 'yhas shape, is an array and has length:\n', hasattr(y, 'shape'), hasattr(y, '__array__'), hasattr(y, '__len__')
print 'X does have attribute fit:\n',hasattr(X, 'fit')
print 'y does have attribute fit:\n',hasattr(y, 'fit')
X-type: (50, 2) <type 'numpy.ndarray'>
y-type: (50,) <type 'numpy.ndarray'>
X has shape, is an array and has length:
True True True
yhas shape, is an array and has length:
True True True
X does have attribute fit:
False
y does have attribute fit:
False
因此,回归器的输入似乎具有.fit
方法的必要属性.它们是具有正确形状的 numpy数组. X
是尺寸为[n_samples, n_features]
的数组,y
是形状为[n_samples,]
的向量,这是文档:
So, the inputs to the regressor seem to have the necessary properties for the .fit
method. They are numpy arrays wit the right shape. X
is an array with the dimensions [n_samples, n_features]
, and y
is a vector with shape [n_samples,]
Here is the documentation:
fit(X,y,sample_weight = None)[源代码]
fit(X, y, sample_weight=None)[source]
Fit the model according to the given training data.
Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape (n_samples,)
Target vector relative to X.
....
现在,我们尝试拟合回归器:
Now we try to fit the regressor:
logmodel = LogisticRegressionCV(Cs =1, dual=False , scoring = accuracy_score, penalty = 'l2')
logmodel.fit(X, y)
...
TypeError: Expected sequence or array-like, got estimator LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
错误消息的来源似乎在scikits的validation.py模块中,
The source of the error message seems to be in scikits' validation.py module, here.
引发此错误消息的代码的唯一部分是以下功能片段:
The only section of the code that raises this error message is the following function-snippet:
def _num_samples(x):
"""Return number of samples in array-like x."""
if hasattr(x, 'fit'):
# Don't get num_samples from an ensembles length!
raise TypeError('Expected sequence or array-like, got '
'estimator %s' % x)
etc.
问题:由于我们用于拟合模型的参数(X
和y
)没有属性"fit",为什么会出现此错误消息
Question: Since the parameters with which we are fitting the model(X
and y
) do not have the attribute 'fit', why is this error message raised
在Canopy 1.7.4.3348(64位)上使用python 2.7和scikit-learn 18.01-3和pandas 0.19.2-2
Using python 2.7 on Canopy 1.7.4.3348 (64 bit) with scikit-learn 18.01-3 and pandas 0.19.2-2
感谢您的帮助:)
推荐答案
问题似乎出在scoring
参数中.您已通过accuracy_score
. accuracy_score
的签名是accuracy_score(y_true, y_pred[, ...])
.但是在模块 logistic.py
The problem seems to be in the scoring
argument. You have passed accuracy_score
. The signature of accuracy_score
is accuracy_score(y_true, y_pred[, ...])
. But in the module logistic.py
if isinstance(scoring, six.string_types):
scoring = SCORERS[scoring]
for w in coefs:
// Other code
if scoring is None:
scores.append(log_reg.score(X_test, y_test))
else:
scores.append(scoring(log_reg, X_test, y_test))
由于您已通过accuracy_score
,因此它不适合上面的第一行.
并且scores.append(scoring(log_reg, X_test, y_test))
用于对估算器评分.但是正如我在上面说的,这里的参数与accuracy_score
的必需参数不匹配.因此是错误.
Since you have passed accuracy_score
, it doesnt fit the first line above.
And scores.append(scoring(log_reg, X_test, y_test))
is used to score the estimator. But as I said above, here the arguments doesnt match the required arguments of accuracy_score
. Hence the error.
解决方法:使用 make_scorer (accuracy_score)用于评分或只是传递字符串"accuracy"
Workaround:Use make_scorer(accuracy_score) in LogisticRegressionCV for scoring or simply pass the string 'accuracy'
logmodel = LogisticRegressionCV(Cs =1, dual=False ,
scoring = make_scorer(accuracy_score),
penalty = 'l2')
OR
logmodel = LogisticRegressionCV(Cs =1, dual=False ,
scoring = 'accuracy',
penalty = 'l2')
注意:
这可能是logistic.py
模块一部分或LogisticRegressionCV文档中的错误,他们应该已经弄清楚评分功能的签名.
This maybe a bug on part of the logistic.py
module or in the documentation of LogisticRegressionCV they should have clarified the signature of scoring function.
您可以向github提交问题,并查看进展 完成
You may submit an issue to the github and see how it goes Done
这篇关于类似于Sklearn LogisticRegressionCV的输入的数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!