DBSCAN on spark:哪个实现 [英] DBSCAN on spark : which implementation

查看:666
本文介绍了DBSCAN on spark:哪个实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Spark上做一些DBSCAN.我目前找到了2种实现方式:

I would like to do some DBSCAN on Spark. I have currently found 2 implementations:

  • https://github.com/irvingc/dbscan-on-spark
  • https://github.com/alitouka/spark_dbscan

我已经使用github中提供的sbt配置测试了第一个,但是:

I have tested the first one with the sbt configuration given in its github but:

    jar中的
  • 功能与doc或github上的源代码中的功能不同.例如,我在jar中找不到 train 函数

我设法使用 fit 函数(在jar中找到了)运行测试,但是epsilon的错误配置(从小到大)使代码陷入了无限循环./p>

I manage to run a test with the fit function (found in the jar) but a bad configuration of epsilon (a little to big) put the code in an infinite loop.

代码:

val model = DBSCAN.fit(eps, minPoints, values, parallelism)

有人设法对第一个图书馆做点东西吗?

Has someone managed to do someting with the first library?

有人测试过第二个吗?

推荐答案

请尝试 ELKI .由于这是Java,因此应该易于从Scala调用.

Please try ELKI. As this is Java, it should be easy to call from Scala.

ELKI进行了很好的优化,并且有了索引,它可以扩展到相当大的数据集.

ELKI is very well optimized, and with indexes it will scale to quite large data sets.

我们试图在基准测试研究中包括这些Spark实现之一-但它用完了内存(这是唯一用完了内存的实现... Spark和Mahout的k均值也在其中最慢):

We tried to include one of these Spark implementations in our benchmarking study - but it ran out of memory (and it was the only implementation to run out of memory ... the k-means of Spark and Mahout were also among the slowest):

Hans-Peter Kriegel,Erich Schubert和Arthur Zimek.
运行时评估的(黑色)艺术:我们是在比较算法还是在实现?
于:知识和信息系统(KAIS). 2016,1–38

Hans-Peter Kriegel, Erich Schubert, and Arthur Zimek.
The (black) art of runtime evaluation: Are we comparing algorithms or implementations?
In: Knowledge and Information Systems (KAIS). 2016, 1–38

在本技术报告中,

Neukirchen教授对DBSCAN的 parallel 实现进行了基准测试.

Professor Neukirchen benchmarked parallel implementations of DBSCAN in this technical report:

Helmut Neukirchen
针对大数据和高可用性的DBSCAN空间集群实施的调查和性能评估性能计算范例

Helmut Neukirchen
Survey and Performance Evaluation of DBSCAN Spatial Clustering Implementations for Big Data and High-Performance Computing Paradigms

显然他可以使用一些Spark实现,但是要注意:

apparently he got some of the Spark implementations working, but noted that:

结果是灾难性的:Apache Spark的任何实现都无法与HPC实现相提并论.特别是在较大(但仍然很小)的数据集上,大多数数据集会完全失败,甚至甚至无法提供正确的结果.

或更早版本:

在使用群集的所有可用内核的同时运行任何"Spark DBSCAN"实现时,我们遇到了内存不足的异常.

When running any of the "Spark DBSCAN" implementations while making use of all available cores of our cluster, we experienced out-of-memory exceptions.

(此外,对于较小的基准测试,"Spark DBSCAN"在928个内核上花费了2406秒,ELKI在1个内核上花费了997秒-其他Spark实施也效果不佳,特别是它没有返回正确的结果...)

(also, "Spark DBSCAN" took 2406 seconds on 928 cores, ELKI took 997 seconds on 1 core for the smaller benchmark - the other Spark implementation didn't fare too well either, in particular it did not return the correct result...)

"Spark上的DBSCAN"没有崩溃,但返回完全错误 集群.

"DBSCAN on Spark" did not crash, but returned completely wrong clusters.

虽然"DBSCAN on Spark"完成得更快,但它可以交付 完全错误的聚类结果.由于DBSCAN的运行时间长得令人望而却步 Spark的实现已经具有最大内核数,我们没有执行 芯数较少的情况下进行测量.

While "DBSCAN on Spark" finishes faster, it delivered completely wrong clustering results. Due to the hopelessly long run-times of the DBSCAN implementations for Spark already with the maximum number of cores, we did not perform measurements with a lower number of cores.

您可以将double[][]数组包装为ELKI数据库:

You can wrap a double[][] array as ELKI database:

// Adapter to load data from an existing array.
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();

// Squared Euclidean is faster than Euclidean.
Clustering<Model> c = new DBSCAN<NumberVector>(
  SquaredEuclideanDistanceFunction.STATIC, eps*eps, minpts).run(db);

for(Cluster<KMeansModel> clu : c.getAllClusters()) {
  // Process clusters
}

另请参见: Java API示例(尤其是如何将DBID映射回到行索引).为了获得更好的性能,请将索引工厂(例如new CoverTree.Factory(...))作为第二个参数传递给StaticArrayDatabase构造函数.

See also: Java API example (in particular, how to map DBIDs back to row indexes). For better performance, pass an index factory (such as new CoverTree.Factory(...)) as second parameter to the StaticArrayDatabase constructor.

这篇关于DBSCAN on spark:哪个实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆