借助GPU支持,在高维数据上实现更快的Kmeans聚类 [英] Faster Kmeans Clustering on High-dimensional Data with GPU Support

查看:296
本文介绍了借助GPU支持,在高维数据上实现更快的Kmeans聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们一直在使用Kmeans对日志进行聚类. 典型的数据集有10密耳.具有100k +功能的样本.

We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features.

要找到最佳k-我们并行运行多个Kmeans,然后选择轮廓分数最高的那个.在90%的情况下,我们得出的k在2到100之间. 当前,我们正在使用scikit-learn Kmeans. 对于这样的数据集,在具有32个内核和244 RAM的ec2实例上,聚类大约需要24小时.

To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM.

我目前正在研究一种更快的解决方案.

I've been currently researching for a faster solution.

我已经测试过的东西:

  1. Kmeans +均值移位

  1. Kmeans + Mean Shift Combination - a little better (for k=1024 --> ~13h) but still slow.

Kmcuda 库-不支持稀疏矩阵表示.它将需要大约3TB RAM才能将该数据集表示为内存中的密集矩阵.

Kmcuda library - doesn't have support for sparse matrix representation. It would require ~3TB RAM to represent that dataset as a dense matrix in memory.

Tensorflow( tf.contrib.factorization.python.ops.KmeansClustering())-今天才开始调查,但是我做错了什么,或者我不知道该怎么做煮.在我的2万个样本和500个功能的第一次测试中,在单个GPU上的群集比在1个线程的CPU上的群集要慢.

Tensorflow (tf.contrib.factorization.python.ops.KmeansClustering()) - only started investigation today, but either I am doing something wrong, or I do not know how to cook it. On my first test with 20k samples and 500 features, clustering on a single GPU is slower than on CPU in 1 thread.

Facebook FAISS -不支持稀疏表示.

Facebook FAISS - no support for sparse representation.

在我的列表中,接下来是PySpark MlLib Kmeans.但这在1个节点上有意义吗?

There is PySpark MlLib Kmeans next on my list. But would it make sense on 1 node?

是否可以在多个GPU上更快地训练我的用例?例如带有8台Tesla V-100的TensorFlow?

Would it be training for my use-case faster on multiple GPUs? e.g., TensorFlow with 8 Tesla V-100?

有没有我没有听说过的神奇图书馆吗?

Is there any magical library that I haven't heard of?

还是只是垂直缩放?

推荐答案

感谢@desertnaut对 RAPIDS cuml 库.

thanks to @desertnaut for his suggestion with RAPIDS cuml library.

可以在此处找到后续操作.

这篇关于借助GPU支持,在高维数据上实现更快的Kmeans聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆