带有索引的Scikit学习train_test_split [英] Scikit-learn train_test_split with indices
问题描述
在使用train_test_split()时如何获取数据的原始索引?
How do I get the original indices of the data when using train_test_split()?
我有以下内容
from sklearn.cross_validation import train_test_split
import numpy as np
data = np.reshape(np.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)
但这不会提供原始数据的索引。
一种解决方法是将索引添加到数据中(例如, data = [[i,d)for i,d in enumerate(data)]
),然后传递将它们放在 train_test_split
内,然后再次扩展。
是否有更清洁的解决方案?
But this does not give the indices of the original data.
One workaround is to add the indices to data (e.g. data = [(i, d) for i, d in enumerate(data)]
) and then pass them inside train_test_split
and then expand again.
Are there any cleaner solutions?
推荐答案
Scikit学习程序在Pandas中的表现非常好,所以我建议您使用它。例如:
Scikit learn plays really well with Pandas, so I suggest you use it. Here's an example:
In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
In [2]: # Giving columns in X a name
X = pd.DataFrame(data, columns=['Column_1', 'Column_2'])
y = pd.Series(labels)
In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=0)
In [4]: X_test
Out[4]:
Column_1 Column_2
2 -1.39 -1.86
8 0.48 -0.81
4 -0.10 -1.83
In [5]: y_test
Out[5]:
2 1
8 1
4 1
dtype: int32
您可以直接在DataFr上调用任何scikit函数
You can directly call any scikit functions on DataFrame/Series and it will work.
假设您要进行LogisticRegression,下面是一种以一种不错的方式检索系数的方法:
Let's say you wanted to do a LogisticRegression, here's how you could retrieve the coefficients in a nice way:
In [6]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model = model.fit(X_train, y_train)
# Retrieve coefficients: index is the feature name (['Column_1', 'Column_2'] here)
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
df_coefs
Out[6]:
Coefficient
Column_1 0.076987
Column_2 -0.352463
这篇关于带有索引的Scikit学习train_test_split的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!