借助GPU支持，在高维数据上实现更快的Kmeans聚类 [英] Faster Kmeans Clustering on High-dimensional Data with GPU Support

查看：296 发布时间：2021/2/15 19:02:39 tensorflow machine-learning pyspark cluster-analysis k-means

本文介绍了借助GPU支持，在高维数据上实现更快的Kmeans聚类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们一直在使用Kmeans对日志进行聚类. 典型的数据集有10密耳.具有100k +功能的样本.

We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features.

要找到最佳k-我们并行运行多个Kmeans，然后选择轮廓分数最高的那个.在90％的情况下，我们得出的k在2到100之间. 当前，我们正在使用scikit-learn Kmeans. 对于这样的数据集，在具有32个内核和244 RAM的ec2实例上，聚类大约需要24小时.

To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM.

我目前正在研究一种更快的解决方案.

I've been currently researching for a faster solution.

我已经测试过的东西:

Kmeans +均值移位

Kmeans + Mean Shift Combination - a little better (for k=1024 --> ~13h) but still slow.

Kmcuda 库-不支持稀疏矩阵表示.它将需要大约3TB RAM才能将该数据集表示为内存中的密集矩阵.

Kmcuda library - doesn't have support for sparse matrix representation. It would require ~3TB RAM to represent that dataset as a dense matrix in memory.

Tensorflow( tf.contrib.factorization.python.ops.KmeansClustering())-今天才开始调查，但是我做错了什么，或者我不知道该怎么做煮.在我的2万个样本和500个功能的第一次测试中，在单个GPU上的群集比在1个线程的CPU上的群集要慢.

Tensorflow (tf.contrib.factorization.python.ops.KmeansClustering()) - only started investigation today, but either I am doing something wrong, or I do not know how to cook it. On my first test with 20k samples and 500 features, clustering on a single GPU is slower than on CPU in 1 thread.

Facebook FAISS -不支持稀疏表示.

Facebook FAISS - no support for sparse representation.

在我的列表中，接下来是PySpark MlLib Kmeans.但这在1个节点上有意义吗?

There is PySpark MlLib Kmeans next on my list. But would it make sense on 1 node?

是否可以在多个GPU上更快地训练我的用例?例如带有8台Tesla V-100的TensorFlow?

Would it be training for my use-case faster on multiple GPUs? e.g., TensorFlow with 8 Tesla V-100?

有没有我没有听说过的神奇图书馆吗?

Is there any magical library that I haven't heard of?

还是只是垂直缩放?

借助GPU支持，在高维数据上实现更快的Kmeans聚类 [英] Faster Kmeans Clustering on High-dimensional Data with GPU Support

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

借助GPU支持，在高维数据上实现更快的Kmeans聚类 [英] Faster Kmeans Clustering on High-dimensional Data with GPU Support

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭