sklearn.cross_validation.StratifiedShuffleSplit-错误:“索引超出范围"； [英] sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

查看：132 发布时间：2020/5/24 1:41:19 python pandas scikit-learn

本文介绍了sklearn.cross_validation.StratifiedShuffleSplit-错误:“索引超出范围"；的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正尝试使用Scikit-learn的分层随机混搭拆分"来拆分样本数据集.我按照Scikit-learn文档此处

I was trying to split the sample dataset using Scikit-learn's Stratified Shuffle Split. I followed the example shown on the Scikit-learn documentation here

import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")

# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)

# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)

for train_index, test_index in sss:
    xtrain, xtest = data[train_index], data[test_index]
    ytrain, ytest = target[train_index], target[test_index]

# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()

但是，在运行此脚本时，出现以下错误:

However, upon running this script, I get the following error:

IndexError: indices are out-of-bounds

有人可以指出我在这里做错了什么吗?谢谢！

Could someone please point out what I am doing wrong here? Thanks!

推荐答案

您遇到了熊猫DataFrame索引与NumPy ndarray索引的不同约定.数组train_index和test_index是行索引的集合.但是data是Pandas DataFrame对象，当您在该对象中使用单个索引时(如在data[train_index]中一样)，Pandas期望train_index包含列标签而不是行索引.您可以使用.values将数据框转换为NumPy数组:

You're running into the different conventions for Pandas DataFrame indexing versus NumPy ndarray indexing. The arrays train_index and test_index are collections of row indices. But data is a Pandas DataFrame object, and when you use a single index into that object, as in data[train_index], Pandas is expecting train_index to contain column labels rather than row indices. You can either convert the dataframe to a NumPy array, using .values:

data_array = data.values
for train_index, test_index in sss:
    xtrain, xtest = data_array[train_index], data_array[test_index]
    ytrain, ytest = target[train_index], target[test_index]

或使用熊猫 .iloc 访问器:

for train_index, test_index in sss:
    xtrain, xtest = data.iloc[train_index], data.iloc[test_index]
    ytrain, ytest = target[train_index], target[test_index]

我倾向于第二种方法，因为它给出类型为DataFrame的xtrain和xtest而不是ndarray，因此保留列标签.

I tend to favour the second approach, since it gives xtrain and xtest of type DataFrame rather than ndarray, and so keeps the column labels.

这篇关于sklearn.cross_validation.StratifiedShuffleSplit-错误:“索引超出范围"；的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

sklearn.cross_validation.StratifiedShuffleSplit-错误:“索引超出范围"； [英] sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

sklearn.cross_validation.StratifiedShuffleSplit-错误:“索引超出范围"； [英] sklearn.cross_validation.StratifiedShuffleSplit - error: &quot;indices are out-of-bounds&quot;

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

sklearn.cross_validation.StratifiedShuffleSplit-错误:“索引超出范围"； [英] sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

登录关闭