类似于Sklearn LogisticRegressionCV的输入的数组 [英] Array like input for Sklearn LogisticRegressionCV

查看:87
本文介绍了类似于Sklearn LogisticRegressionCV的输入的数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最初,我从.csv文件读取数据,但是在这里我从列表构建数据框,以便可以重现该问题.目的是使用LogisticRegressionCV训练具有交叉验证的逻辑回归模型.

Originally, I read the data from a .csv file, but here I build the dataframe from lists so the problem can be reproduced. The aim is to train a logistic regression model with cross-validation using LogisticRegressionCV.

indeps = ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'M', 'F', 'M', 'M', 'F', 'F']
dep = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

data = [indeps, dep] 
cols = ['state', 'cat_bins']

data_dict = dict((x[0], x[1]) for x in zip(cols, data))

df = pd.DataFrame.from_dict(data_dict)
df.tail()

    cat_bins    state
45  0.0           F
46  0.0           M
47  0.0           M
48  0.0           F
49  0.0           F


'''Use Pandas' to encode independent variables.  Notice that
 we are returning a sparse dataframe '''

def heat_it2(dataframe, lst_of_columns):
    dataframe_hot = pd.get_dummies(dataframe,
                                   prefix = lst_of_columns,
                                   columns = lst_of_columns, sparse=True,)
    return dataframe_hot

train_set_hot = heat_it2(df, ['state'])
train_set_hot.head(2)

    cat_bins    state_F     state_M
0     1.0         0            1
1     1.0         1            0

'''Use the dataframe to set up the prospective inputs to the model as numpy arrays'''

indeps_hot = ['state_F', 'state_M']

X = train_set_hot[indeps_hot].values
y = train_set_hot['cat_bins'].values

print 'X-type:', X.shape, type(X)
print 'y-type:', y.shape, type(y)
print 'X has shape, is an array and has length:\n', hasattr(X, 'shape'), hasattr(X, '__array__'), hasattr(X, '__len__')
print 'yhas shape, is an array and has length:\n', hasattr(y, 'shape'), hasattr(y, '__array__'), hasattr(y, '__len__')
print 'X does have attribute fit:\n',hasattr(X, 'fit')
print 'y does have attribute fit:\n',hasattr(y, 'fit')

X-type: (50, 2) <type 'numpy.ndarray'>
y-type: (50,) <type 'numpy.ndarray'>
X has shape, is an array and has length:
True True True
yhas shape, is an array and has length:
True True True
X does have attribute fit:
False
y does have attribute fit:
False

因此,回归器的输入似乎具有.fit方法的必要属性.它们是具有正确形状的 numpy数组. X是尺寸为[n_samples, n_features]的数组,y是形状为[n_samples,]的向量,这是文档:

So, the inputs to the regressor seem to have the necessary properties for the .fit method. They are numpy arrays wit the right shape. X is an array with the dimensions [n_samples, n_features], and y is a vector with shape [n_samples,] Here is the documentation:

fit(X,y,sample_weight = None)[源代码]

fit(X, y, sample_weight=None)[source]

Fit the model according to the given training data.
Parameters: 

X : {array-like, sparse matrix}, shape (n_samples, n_features)

    Training vector, where n_samples is the number of samples and n_features is the number of features.
  y : array-like, shape (n_samples,)

Target vector relative to X.

....

现在,我们尝试拟合回归器:

Now we try to fit the regressor:

logmodel = LogisticRegressionCV(Cs =1, dual=False , scoring = accuracy_score, penalty = 'l2')
logmodel.fit(X, y)

...

    TypeError: Expected sequence or array-like, got estimator LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
    intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
    penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
    verbose=0, warm_start=False)

错误消息的来源似乎在scikits的validation.py模块中,

The source of the error message seems to be in scikits' validation.py module, here.

引发此错误消息的代码的唯一部分是以下功能片段:

The only section of the code that raises this error message is the following function-snippet:

def _num_samples(x):
    """Return number of samples in array-like x."""
    if hasattr(x, 'fit'):
        # Don't get num_samples from an ensembles length!
        raise TypeError('Expected sequence or array-like, got '
                        'estimator %s' % x)
    etc.

问题:由于我们用于拟合模型的参数(Xy)没有属性"fit",为什么会出现此错误消息

Question: Since the parameters with which we are fitting the model(X and y) do not have the attribute 'fit', why is this error message raised

在Canopy 1.7.4.3348(64位)上使用python 2.7和scikit-learn 18.01-3和pandas 0.19.2-2

Using python 2.7 on Canopy 1.7.4.3348 (64 bit) with scikit-learn 18.01-3 and pandas 0.19.2-2

感谢您的帮助:)

推荐答案

问题似乎出在scoring参数中.您已通过accuracy_score. accuracy_score的签名是accuracy_score(y_true, y_pred[, ...]).但是在模块 logistic.py

The problem seems to be in the scoring argument. You have passed accuracy_score. The signature of accuracy_score is accuracy_score(y_true, y_pred[, ...]). But in the module logistic.py

if isinstance(scoring, six.string_types):
    scoring = SCORERS[scoring]
for w in coefs:
    // Other code
    if scoring is None:
        scores.append(log_reg.score(X_test, y_test))
    else:
        scores.append(scoring(log_reg, X_test, y_test))

由于您已通过accuracy_score,因此它不适合上面的第一行. 并且scores.append(scoring(log_reg, X_test, y_test))用于对估算器评分.但是正如我在上面说的,这里的参数与accuracy_score的必需参数不匹配.因此是错误.

Since you have passed accuracy_score, it doesnt fit the first line above. And scores.append(scoring(log_reg, X_test, y_test)) is used to score the estimator. But as I said above, here the arguments doesnt match the required arguments of accuracy_score. Hence the error.

解决方法:使用 make_scorer (accuracy_score)用于评分或只是传递字符串"accuracy"

Workaround:Use make_scorer(accuracy_score) in LogisticRegressionCV for scoring or simply pass the string 'accuracy'

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
                                scoring = make_scorer(accuracy_score), 
                                penalty = 'l2')

                         OR

logmodel = LogisticRegressionCV(Cs =1, dual=False , 
                                scoring = 'accuracy', 
                                penalty = 'l2')

注意:

这可能是logistic.py模块一部分或LogisticRegressionCV文档中的错误,他们应该已经弄清楚评分功能的签名.

This maybe a bug on part of the logistic.py module or in the documentation of LogisticRegressionCV they should have clarified the signature of scoring function.

您可以向github提交问题,并查看进展 完成

You may submit an issue to the github and see how it goes Done

这篇关于类似于Sklearn LogisticRegressionCV的输入的数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆