带有索引的 Scikit-learn train_test_split [英] Scikit-learn train_test_split with indices

查看：14 发布时间：2021/12/25 14:24:31 python scipy scikit-learn classification

本文介绍了带有索引的 Scikit-learn train_test_split的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 train_test_split() 时如何获取数据的原始索引?

How do I get the original indices of the data when using train_test_split()?

我有以下内容

from sklearn.cross_validation import train_test_split
import numpy as np
data = np.reshape(np.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)

但这并没有给出原始数据的索引.一种解决方法是将索引添加到数据中(例如 data = [(i, d) for i, d in enumerate(data)])，然后将它们传递到 train_test_split然后再次展开.有没有更干净的解决方案?

But this does not give the indices of the original data. One workaround is to add the indices to data (e.g. data = [(i, d) for i, d in enumerate(data)]) and then pass them inside train_test_split and then expand again. Are there any cleaner solutions?

推荐答案

Scikit learn 非常适合 Pandas，所以我建议你使用它.举个例子:

Scikit learn plays really well with Pandas, so I suggest you use it. Here's an example:

In [1]: 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels

In [2]: # Giving columns in X a name
X = pd.DataFrame(data, columns=['Column_1', 'Column_2'])
y = pd.Series(labels)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=0)

In [4]: X_test
Out[4]:

     Column_1    Column_2
2   -1.39       -1.86
8    0.48       -0.81
4   -0.10       -1.83

In [5]: y_test
Out[5]:

2    1
8    1
4    1
dtype: int32

您可以直接调用 DataFrame/Series 上的任何 scikit 函数，它会起作用.

You can directly call any scikit functions on DataFrame/Series and it will work.

假设您想要进行 LogisticRegression，以下是您如何以一种不错的方式检索系数:

Let's say you wanted to do a LogisticRegression, here's how you could retrieve the coefficients in a nice way:

In [6]: 
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model = model.fit(X_train, y_train)

# Retrieve coefficients: index is the feature name (['Column_1', 'Column_2'] here)
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
df_coefs
Out[6]:
            Coefficient
Column_1    0.076987
Column_2    -0.352463

这篇关于带有索引的 Scikit-learn train_test_split的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带有索引的 Scikit-learn train_test_split [英] Scikit-learn train_test_split with indices

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

带有索引的 Scikit-learn train_test_split [英] Scikit-learn train_test_split with indices

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭