K均值仅使用带有scikit-learn的特定数据框列 [英] K-means using only specific dataframe columns with scikit-learn

查看:65
本文介绍了K均值仅使用带有scikit-learn的特定数据框列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scikit-learn库中的k-means算法,并且要聚类的值位于3列的熊猫数据框中:IDvalue_1value_2.

I'm using the k-means algorithm from the scikit-learn library, and the values I want to cluster are in a pandas dataframe with 3 columns: ID, value_1 and value_2.

我想使用value_1value_2对信息进行聚类,但是我也想保持与ID关联(这样我就可以在每个聚类中创建ID的列表).

I want to cluster the information using value_1 and value_2, but I also want to keep the ID associated with it (so I can create a list of IDs in each cluster).

做到这一点的最佳方法是什么?目前,它也使用ID编号进行聚类,这并不是故意的.

What's the best way of doing this? Currently it clusters using the ID number as well and that's not the intention.

我当前的代码(X是熊猫数据框):

My current code (X is the pandas dataframe):

kmeans = KMeans(n_clusters=2, n_init=3, max_iter=3000, random_state=1)
(X_train, X_test) = train_test_split(X[['value_1','value_2']],test_size=0.30)
kmeans = kmeans.fit(X_train)

推荐答案

仅使用感兴趣的列进行聚类(如您的示例).然后将标签kmeans.labels_的列表作为另一列添加到X_train(或X_test).标签的顺序与原始行的顺序相同.

Do the clustering using only the columns of interest (as in your example). Then add the list of labels kmeans.labels_ as another column to X_train (or X_test). The labels are in the same order as the original rows.

# A toy DF
X = pd.DataFrame({'id': [1,2,3,4,5],
                  'value_1': [1,3,1,4,5],
                  'value_2': [0,0,1,5,0]})

# Split ALL columns
(X_train, X_test) = train_test_split(X,test_size=0.30)
# Cluster using SOME columns
kmeans = kmeans.fit(X_train[['value_1','value_2']])
# Save the labels
X_train.loc[:,'labels'] = kmeans.labels_

由于X_trainX_tests都是X的切片,因此您可能会在此处看到警告:

Since both X_train and X_tests are slices of X, you may see a warning here:

试图在DataFrame的切片副本上设置一个值.

A value is trying to be set on a copy of a slice from a DataFrame.

您可以忽略它.

X_train
#   id  value_1  value_2  labels
#4   5        5        0       0
#0   1        1        0       0
#3   4        4        5       1

这篇关于K均值仅使用带有scikit-learn的特定数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆