使用 Sklearn 对 Pandas DataFrame 进行线性回归(IndexError:元组索引超出范围) [英] Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)

查看:31
本文介绍了使用 Sklearn 对 Pandas DataFrame 进行线性回归(IndexError:元组索引超出范围)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Python 新手,并尝试在 Pandas 数据帧上使用 sklearn 执行线性回归.这就是我所做的:

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:

data = pd.read_csv('xxxx.csv')

之后我得到了一个两列的 DataFrame,让我们称它们为c1"、c2".现在我想对 (c1,c2) 的集合进行线性回归,所以我输入了

After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered

X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)

导致以下错误

IndexError: tuple index out of range

这里出了什么问题?另外,我想知道

What's wrong here? Also, I'd like to know

  1. 可视化结果
  2. 根据结果做出预测?

我搜索并浏览了大量网站,但似乎没有一个网站能指导初学者正确使用语法.也许对专家来说显而易见的东西对我这样的新手来说并不那么明显.

I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.

你能帮忙吗?非常感谢您抽出宝贵时间.

Can you please help? Thank you very much for your time.

PS:我注意到在 stackoverflow 中有大量初学者的问题被否决了.请考虑这样一个事实,对专家用户来说似乎很明显的事情可能需要初学者几天才能弄清楚.按下向下箭头时请慎重,以免损害此讨论社区的活力.

PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.

推荐答案

假设你的 csv 看起来像这样:

Let's assume your csv looks something like:

c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...

我生成了这样的数据:

import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt

length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))

这个数据被保存到 test.csv(只是为了让你知道它来自哪里,显然你会使用你自己的).

This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).

data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]

您需要查看输入到 .fit() 中的数据的形状.

You need to take a look at the shape of the data you are feeding into .fit().

这里 x.shape = (10,) 但我们需要它是 (10, 1),见 sklearn.y 也是如此.所以我们重塑:

Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:

x = x.reshape(length, 1)
y = y.reshape(length, 1)

现在我们创建回归对象,然后调用fit():

Now we create the regression object and then call fit():

regr = linear_model.LinearRegression()
regr.fit(x, y)

# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y,  color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

参见 sklearn 线性回归示例.

See sklearn linear regression example.

这篇关于使用 Sklearn 对 Pandas DataFrame 进行线性回归(IndexError:元组索引超出范围)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆