少于两个样本的R ^ 2得分定义不明确.Python Sklearn [英] R^2 score is not well-defined with less than two samples. Python Sklearn

查看:244
本文介绍了少于两个样本的R ^ 2得分定义不明确.Python Sklearn的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用线性回归分类器来预测一些值.我已经弄清楚了基本内容,现在看起来像这样:

I am using a Linear Regression classifier to predict some values. I already figured the basic part of the out and now it looks like this:

import time as ti
import pandas as pd 
import numpy as np
from matplotlib import pyplot as plt 
import csv
from sklearn.datasets import load_boston
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from scipy.interpolate import * 
import datetime

data = pd.read_csv(r"C:\Users\simon\Desktop\Datenbank\visualisierung\includes\csv.csv")         
x = np.array(data["day"])   
y = np.array(data["balance"])

reg = linear_model.LinearRegression()
X_train, X_test, y_train, y_test, i_train, i_test = train_test_split(x, y, data.index, test_size=0.2, random_state=4)

X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

i_train = i_train.values.reshape(-1, 1)
i_test = i_test.values.reshape(-1, 1)


reg.fit(i_train, y_train)

print(reg.score(i_test, y_test))

252128,6/6/19
252899,7/6/19
253670,8/6/19
254441,9/6/19

我总共有27行.

由于某种原因它不起作用.

It doesn't work for some reason.

UndefinedMetricWarning: R^2 score is not well-defined with less than two samples.

dtype和形状为:

The dtypes and shapes are:

X_train, X_test = object #dtype
X_train = (21,)  #shape
X_test = (6,)    #shape

y_train, y_test = int64 #dtype
y_train, y_test = (1, 21) #shape

i_train, i_test = int64 #dtype
i_train, i_test = (1, 21) #shape

X_train,X_test,y_train,y_test,i_train,i_test都是:

X_train, X_test, y_train, y_test, i_train, i_test are all a:

<class 'numpy.ndarray'>

我可以想象那是因为我没有足够的例子.

I could imagine that thats because i dont have enough examples.

为什么会发生这种情况,我该如何预防呢?

Why does this happen and how can i prevent it?

推荐答案

正如 因此,如果您的数据集仅包含1个要素,则需要使用以下方法重塑训练和测试集:

Therefore, if your dataset consists of only 1 feature, you need to reshape your training and test sets using:

X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

,其余代码应正常工作.

and the rest of your code should work properly.

按照OP的规范,数据集似乎是一个时间序列.线性回归不能正确地对数据建模,但是,作为一个有趣的玩具示例,您可以将日期转换为POSIX时间,分割数据,并测试不同的算法.

After OP's specifications, the dataset seems to be a time series. Linear Regression is not going to properly model your data, but, as a toy example to have fun with, you can convert dates to POSIX time, split the data, and test different algorithms.

假设您的数据集:

    balance day
0   252128  6/6/19
1   252899  7/6/19
2   253670  8/6/19
3   254441  9/6/19
4   255944  10/6/19
5   256041  11/6/19
6   256670  12/6/19
7   257441  13/6/19
8   258128  14/6/19
9   258899  15/6/19
10  259670  16/6/19
11  260241  17/6/19
12  260444  18/6/19
13  260341  19/6/19
14  260670  20/6/19
15  261441  21/6/19

您可以通过以下方式修改代码:

you can modify the code this way:

import pandas as pd
from sklearn import linear_model

data = pd.read_csv('csv.csv')

X = pd.to_datetime(data['day'])
# convert to POSIX time by dividing by 10**9
X = X.astype("int64").values.reshape(-1, 1) // 10**9
y = data['balance']

# split the data
X_train = X[:12]
y_train = y[:12]
X_test = X[-4:]
y_test = y[-4:]

reg.fit(X_train, y_train)

print(reg.score(X_test, y_test))

reg.predict(X_test)

您会得到什么?一个非常糟糕的解决方案.

What do you get? A very poor solution.

这篇关于少于两个样本的R ^ 2得分定义不明确.Python Sklearn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆