K均值仅使用带有scikit-learn的特定数据框列 [英] K-means using only specific dataframe columns with scikit-learn
问题描述
我正在使用scikit-learn
库中的k-means
算法,并且要聚类的值位于3列的熊猫数据框中:ID
,value_1
和value_2
.
I'm using the k-means
algorithm from the scikit-learn
library, and the values I want to cluster are in a pandas dataframe with 3 columns: ID
, value_1
and value_2
.
我想使用value_1
和value_2
对信息进行聚类,但是我也想保持与ID
关联(这样我就可以在每个聚类中创建ID
的列表).
I want to cluster the information using value_1
and value_2
, but I also want to keep the ID
associated with it (so I can create a list of ID
s in each cluster).
做到这一点的最佳方法是什么?目前,它也使用ID
编号进行聚类,这并不是故意的.
What's the best way of doing this? Currently it clusters using the ID
number as well and that's not the intention.
我当前的代码(X
是熊猫数据框):
My current code (X
is the pandas dataframe):
kmeans = KMeans(n_clusters=2, n_init=3, max_iter=3000, random_state=1)
(X_train, X_test) = train_test_split(X[['value_1','value_2']],test_size=0.30)
kmeans = kmeans.fit(X_train)
推荐答案
仅使用感兴趣的列进行聚类(如您的示例).然后将标签kmeans.labels_
的列表作为另一列添加到X_train
(或X_test
).标签的顺序与原始行的顺序相同.
Do the clustering using only the columns of interest (as in your example). Then add the list of labels kmeans.labels_
as another column to X_train
(or X_test
). The labels are in the same order as the original rows.
# A toy DF
X = pd.DataFrame({'id': [1,2,3,4,5],
'value_1': [1,3,1,4,5],
'value_2': [0,0,1,5,0]})
# Split ALL columns
(X_train, X_test) = train_test_split(X,test_size=0.30)
# Cluster using SOME columns
kmeans = kmeans.fit(X_train[['value_1','value_2']])
# Save the labels
X_train.loc[:,'labels'] = kmeans.labels_
由于X_train
和X_tests
都是X
的切片,因此您可能会在此处看到警告:
Since both X_train
and X_tests
are slices of X
, you may see a warning here:
试图在DataFrame的切片副本上设置一个值.
A value is trying to be set on a copy of a slice from a DataFrame.
您可以忽略它.
X_train
# id value_1 value_2 labels
#4 5 5 0 0
#0 1 1 0 0
#3 4 4 5 1
这篇关于K均值仅使用带有scikit-learn的特定数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!