DBSCAN on spark:哪个实现 [英] DBSCAN on spark : which implementation

查看:29
本文介绍了DBSCAN on spark:哪个实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Spark 上做一些 DBSCAN.我目前找到了 2 个实现:

I would like to do some DBSCAN on Spark. I have currently found 2 implementations:

我已经使用其 github 中给出的 sbt 配置测试了第一个,但是:

I have tested the first one with the sbt configuration given in its github but:

  • jar 中的函数与文档或 github 上的源中的函数不同.比如我在jar包中找不到train函数

我设法使用 fit 函数(在 jar 中找到)运行测试,但是 epsilon 的错误配置(有点大)使代码陷入无限循环.

I manage to run a test with the fit function (found in the jar) but a bad configuration of epsilon (a little to big) put the code in an infinite loop.

代码:

val model = DBSCAN.fit(eps, minPoints, values, parallelism)

有人设法用第一个图书馆做点什么吗?

Has someone managed to do someting with the first library?

有人测试过第二个吗?

推荐答案

请尝试 ELKI.因为这是 Java,所以从 Scala 调用应该很容易.

Please try ELKI. As this is Java, it should be easy to call from Scala.

ELKI 进行了很好的优化,通过索引,它可以扩展到相当大的数据集.

ELKI is very well optimized, and with indexes it will scale to quite large data sets.

我们试图在我们的基准测试研究中包含这些 Spark 实现之一 - 但它耗尽了内存(而且它是唯一一个耗尽内存的实现……Spark 和 Mahout 的 k-means 也是其中之一最慢):

We tried to include one of these Spark implementations in our benchmarking study - but it ran out of memory (and it was the only implementation to run out of memory ... the k-means of Spark and Mahout were also among the slowest):

Hans-Peter Kriegel、Erich Schubert 和 Arthur Zimek.
运行时评估的(黑色)艺术:我们是比较算法还是实现?
在:知识和信息系统(KAIS).2016, 1–38

Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek.
The (black) art of runtime evaluation: Are we comparing algorithms or implementations?
In: Knowledge and Information Systems (KAIS). 2016, 1–38

Neukirchen 教授在这份技术报告中对 DBSCAN 的并行实现进行了基准测试:

Professor Neukirchen benchmarked parallel implementations of DBSCAN in this technical report:

赫尔穆特·纽基兴
大数据和高数据的 DBSCAN 空间聚类实现的调查和性能评估性能计算范式

显然他让一些 Spark 实现工作,但注意到:

apparently he got some of the Spark implementations working, but noted that:

结果是毁灭性的:Apache Spark 的任何实现都无法与 HPC 实现相提并论.特别是在更大(但仍然相当小)的数据集上,它们中的大多数完全失败并且甚至无法提供正确的结果.

The result is devastating: none of the implementations for Apache Spark is anywhere near to the HPC implementations. In particular on bigger (but still rather small) data sets, most of them fail completely and do not even deliver correct results.

及更早:

在使用集群的所有可用内核的同时运行任何Spark DBSCAN"实现时,我们遇到了内存不足异常.

When running any of the "Spark DBSCAN" implementations while making use of all available cores of our cluster, we experienced out-of-memory exceptions.

(此外,Spark DBSCAN"在 928 个内核上花费了 2406 秒,ELKI 在 1 个内核上花费了 997 秒进行较小的基准测试——另一个 Spark 实现也不太好,特别是它没有返回正确的结果...)

(also, "Spark DBSCAN" took 2406 seconds on 928 cores, ELKI took 997 seconds on 1 core for the smaller benchmark - the other Spark implementation didn't fare too well either, in particular it did not return the correct result...)

DBSCAN on Spark"没有崩溃,但返回完全错误集群.

"DBSCAN on Spark" did not crash, but returned completely wrong clusters.

虽然DBSCAN on Spark"完成得更快,但它交付了完全错误的聚类结果.由于 DBSCAN 的运行时间过长Spark 的实现已经达到了最大核数,我们没有执行使用较少数量的内核进行测量.

While "DBSCAN on Spark" finishes faster, it delivered completely wrong clustering results. Due to the hopelessly long run-times of the DBSCAN implementations for Spark already with the maximum number of cores, we did not perform measurements with a lower number of cores.

您可以将 double[][] 数组包装为 ELKI 数据库:

You can wrap a double[][] array as ELKI database:

// Adapter to load data from an existing array.
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();

// Squared Euclidean is faster than Euclidean.
Clustering<Model> c = new DBSCAN<NumberVector>(
  SquaredEuclideanDistanceFunction.STATIC, eps*eps, minpts).run(db);

for(Cluster<KMeansModel> clu : c.getAllClusters()) {
  // Process clusters
}

另见:Java API 示例(特别是如何将 DBID 映射回行索引).为了获得更好的性能,将索引工厂(例如 new CoverTree.Factory(...))作为第二个参数传递给 StaticArrayDatabase 构造函数.

See also: Java API example (in particular, how to map DBIDs back to row indexes). For better performance, pass an index factory (such as new CoverTree.Factory(...)) as second parameter to the StaticArrayDatabase constructor.

这篇关于DBSCAN on spark:哪个实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆