如何使用KMeans使用具有多个功能的数据框获取质心 [英] How to apply KMeans to get the centroid using dataframe with multiple features

查看:144
本文介绍了如何使用KMeans使用具有多个功能的数据框获取质心的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在关注详细的KMeans教程: https://github.com/python-engineer/MLfromscratch/blob/master/mlfromscratch/kmeans.py 使用具有2个特征的数据集.

I am following this detailed KMeans tutorial: https://github.com/python-engineer/MLfromscratch/blob/master/mlfromscratch/kmeans.py which uses dataset with 2 features.

但是我有一个具有5个特征(列)的数据框,因此,我不使用本教程中的def euclidean_distance(x1, x2):函数,而是按如下所示计算欧几里德距离.

But I have a dataframe with 5 features (columns), so instead of using the def euclidean_distance(x1, x2): function in the tutorial, I compute the euclidean distance as below.

def euclidean_distance(df):
    n = df.shape[1]
    distance_matrix = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            distance_matrix[i,j] = np.sqrt(np.sum((df.iloc[:,i] - df.iloc[:,j])**2))
    return distance_matrix

接下来,我要实现本教程中计算质心的部分,如下所示;

Next I want to implement the part in the tutorial that computes the centroid as below;

def _closest_centroid(self, sample, centroids):
    distances = [euclidean_distance(sample, point) for point in centroids]

由于我的def euclidean_distance(df):函数仅接受1个参数df,如何最好地实现它以获得质心?

Since my def euclidean_distance(df): function only takes 1 argument, df, how best can I implement it in order to get the centroid?

我的样本数据集df如下:

My sample dataset, df is as below:

col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,-2.14
0.52,0.44,0.19,0.29,30.44
1.27,1.15,1.32,0.60,-161.63
0.88,0.79,0.63,0.58,-49.52
1.39,1.15,1.32,0.41,-188.52
0.86,0.80,0.65,0.65,-45.27

[已添加:plot()函数]

您包含的绘图函数给出了错误 TypeError:类型为'itertools.combinations'的对象没有len(),我通过将len(combinations)更改为len(list(combinations))来解决此问题.但是,输出为 不是散点图.对我需要在此处解决的问题有任何想法吗?

The plot function you included gave an error TypeError: object of type 'itertools.combinations' has no len(), which I fixed by changing len(combinations) to len(list(combinations)). However the output is is not a scatter plot. Any idea on what I need to fix here?

推荐答案

即使增加数据集中的要素数量,读取数据并对其进行聚类也不会引发任何错误. 实际上,当您重新定义euclidean_distance函数时,您只会在该部分代码中出现错误.

Reading the data and clustering it should not throw any errors, even when you increase the number of features in the dataset. In fact, you only get an error in that part of the code when you redefine the euclidean_distance function.

此解决方案解决了您得到的绘图功能的实际错误.

This asnwer addresses the actual error of the plotting function that you are getting.

   def plot(self):
      fig, ax = plt.subplots(figsize=(12, 8))

       for i, index in enumerate(self.clusters):
           point = self.X[index].T
           ax.scatter(*point)

获取给定簇中的所有点,并尝试制作散点图.

takes all points in a given cluster and tries to make a scatterplot.

ax.scatter(*point)中的星号表示该点已解包.

the asterisk in ax.scatter(*point) means that point is unpacked.

此处的隐式假设(这就是为什么这可能很难发现的原因)是point应该是二维的.然后,将各个部分解释为要绘制的x,y值.

The implicit assumption here (and this is why this might be hard to spot) is that point should be 2-dimensional. Then, the individual parts get interpreted as x,y values to be plotted.

但是由于您有5个要素,所以点是5维的.

But since you have 5 features, point is 5-dimensional.

查看 ax.scatter的文档:

matplotlib.axes.Axes.scatter
Axes.scatter(self, x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None,
verts=<deprecated parameter>, edgecolors=None, *, plotnonfinite=False,
data=None, **kwargs)

so,ax.scatter接受的前几个参数(除了self之外)是:

so ,the first few arguments that ax.scatter takes (other than self) are:

x 
y
s (i.e. the markersize)
c (i.e. the color)
marker (i.e. the markerstyle)

前四个(即x,y和s anc c允许浮动),但您的数据集是5维的,因此第五个要素被解释为标记,需要使用MarkerStyle.由于它处于浮动状态,因此会引发错误.

the first four, i.e. x,y, s anc c allow floats, but your dataset is 5-dimensional, so the fifth feature gets interpreted as marker, which expects a MarkerStyle. Since it is getting a float, it throws the error.

一次只能查看2或3个维度,或者使用降维(例如主成分分析)将数据投影到较低维度的空间.

only look at 2 or 3 dimensions at a time, or use dimensionality reduction (e.g. principal component analysis) to project the data to a lower-dimensional space.

对于第一个选项,您可以在KMeans类中重新定义plot方法:

For the first option, you can redefine the plot method within the KMeans class:

def plot(self):
    

    import itertools
    combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
    
    fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination

    for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
        
        
        for i, index in enumerate(self.clusters):
            point = self.X[index].T
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            ax.scatter(px, py)

        for point in self.centroids:
            
            # only get the coordinates for this combination:
            px, py = point[x], point[y]
            
            ax.scatter(px, py, marker="x", color='black', linewidth=2)

        ax.set_title('feature {} vs feature {}'.format(x,y))
    plt.show()

这篇关于如何使用KMeans使用具有多个功能的数据框获取质心的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆