如何使用现有的K-means模型细分新数据? [英] How to segment new data with existing K-means model?

查看:392
本文介绍了如何使用现有的K-means模型细分新数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用k-means聚类建立了细分模型.

I have built a segmentation model using k-means clustering.

有人可以描述将新数据分配到这些段中的过程吗?

Could anybody describe the process for assigning new data into these segments?

目前,我正在应用与构建模型然后计算欧几里德距离相同的变换/标准化/离群值.最小距离是记录所属的段.

Currently I am applying the same transformations/standardisations/outliers as I did to build the model and then calculating the euclidean distance. The minimum distance is the segment that record falls into.

但是,我看到大多数人都属于一个特定的细分领域,我想知道我是否一路上错过了什么?

But, I am seeing the majority fall into 1 particular segment and I am wondering if I have missed something along the way?

谢谢

推荐答案

在某些情况下,可以根据与最近均值的欧几里得距离对新观测值进行分类,但是会忽略原始聚类的形状/大小.

Classifying a new observation based on euclidean distance to the nearest mean may work for some scenarios, however it ignores the shape/size of the original cluster.

一种解决方法是使用原始聚类数据来帮助对每个新观测值进行分类(例如,使用KNN

One way around this would be to use the original cluster data to help classify each new observation (e.g., using KNN http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

作为替代方案,您可以考虑使用替代聚类技术,例如高斯混合法:
http://en.wikipedia.org/wiki/Mixture_model
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html /mixture.html

As an alternative, you might consider using an alternative clustering technique, such as Mixture of Gaussians:
http://en.wikipedia.org/wiki/Mixture_model
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html

使用此方法,您不仅会获得每个聚类的均值,而且会得到方差.对于每个新观察,然后可以计算它属于每个聚类的概率.该概率将考虑原始簇的大小/形状.使用软"方法的类型类型也更好,因为它告诉您每个新观测值属于每个聚类的强度,并且您可以将标记观测值做为离群值大于离所有聚类一定数量标准差的离群值

Using this, you will not only get a mean for each cluster, but also a variance. For each new observation, you can then compute the probability that it belongs to each cluster. That probability will take the original cluster size/shape into account. It's also nicer to work with type type of "soft" approach because it tells you how strongly each new observation belongs to each cluster, and you can do things like tag observations as outliers that are greater than some number of standard deviations away from all clusters.

这篇关于如何使用现有的K-means模型细分新数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆