为什么 scikit-learn 对不同的回归量需要不同的数据形状? [英] Why does scikit-learn demand different data shapes for different regressors?

查看:27
本文介绍了为什么 scikit-learn 对不同的回归量需要不同的数据形状?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用 sklearn 时,我总是发现自己在重塑我的数据,这很烦人,并使我的代码变得丑陋.为什么不能使库处理各种数据形状,并进行适当的解释?例如,要使用线性回归器,我需要做

I always find myself reshaping my data when I'm working with sklearn, and it's irritating and makes my code ugly. Why can't the library be made to work with a variety of data shapes, interpreting appropriately? For example, to work with a linear regressor I need to do

from sklearn.linear_model import LinearRegression
x = np.random.rand(10).reshape(-1,1)
y = np.random.rand(10).reshape(-1,1)
regr = LinearRegression()
regr.fit(x,y)

但是如果我想使用支持向量回归器,那么我不会重塑自变量:

but if I want to use a support vector regressor, then I don't reshape the independent variable:

from sklearn.svm import SVR
x = np.random.rand(10).reshape(-1,1)
y = np.random.rand(10)
regr = SVR()
regr.fit(x,y)

我认为图书馆以这种方式设计是有原因的;谁能照亮我?

I presume there is some reason why the library is designed in this way; can anyone illuminate me?

推荐答案

当你做y = np.random.rand(10)时,y是一个一维数组 [10,].它是行向量还是列向量都没有关系.它只是一个只有一维的向量.看看这个答案和<一个 href="https://stackoverflow.com/questions/16995071/numpy-array-that-is-n-1-and-n">这个也是来理解它背后的哲学.

When you do y = np.random.rand(10), y is a one dimensional array of [10,]. It doesnt matter if its a row vector or column vector. Its just a vector with only one dimension. Take a look at this answer and this too to understand the philosophy behind it.

它是麻木哲学"的一部分.而 sklearn 依赖于 numpy.

Its a part of "numpy philosophy". And sklearn depends on numpy.

至于您的评论:-

为什么 sklearn 不会自动理解如果我传递它的形状为 (n,) n_samples=n 和 n_features=1

why sklearn doesn't automatically understand that if I pass it something of the shape (n,) that n_samples=n and n_features=1

sklearn 可能无法推断其是 n_samples=n and n_features=1 还是其他方式(n_samples=1 and n_features=n)仅基于 X 数据.它可以完成,如果 y 被传递,这可以清楚地说明 n_samples.

sklearn may not infer whether its n_samples=n and n_features=1 or other way around (n_samples=1 and n_features=n) based on X data alone. It may be done, if y is passed which may make it clear about the n_samples.

但这意味着更改所有依赖于这种语义的代码,这可能会破坏很多东西,因为 sklearn 严重依赖于 numpy 操作.

But that means changing all the code which relies on this type of semantics and that may break many things, because sklearn depends on numpy operations heavily.

您可能还想查看以下讨论类似问题的链接.

You may also want to check the following links where similar issues are discussed.

这篇关于为什么 scikit-learn 对不同的回归量需要不同的数据形状?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆