LSH Spark永久卡在roximateSimilarityJoin()函数中 [英] LSH Spark stucks forever at approxSimilarityJoin() function

查看:279
本文介绍了LSH Spark永久卡在roximateSimilarityJoin()函数中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实施LSH Spark,以在包含50000行和每行约5000个特征的大型数据集上为每个用户找到最近的邻居.这是与此相关的代码.

I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this.

    MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features")
                        .setOutputCol("hashes");

    MinHashLSHModel model = mh.fit(dataset);

    Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance");

    approxSimilarityJoin.show();

作业被卡在roximateSimilarityJoin()函数中,并且永远不会超出它.请让我知道如何解决.

The job gets stuck at approxSimilarityJoin() function and never goes beyond it. Please let me know how to solve it.

推荐答案

如果将其放置足够长的时间,它将完成,但是您可以做一些事情来加快速度.查看源代码,您可以看到算法

It will finish if you leave it long enough, however there are some things you can do to speed it up. Reviewing the source code you can see the algorithm

  1. 哈希输入
  2. 在散列上加入2个数据集
  3. 使用udf计算jaccard距离,
  4. 使用阈值过滤数据集.

由于数据被重新整理,因此连接可能是其中的最慢部分.因此,您可以尝试一些事情:

The join is probably the slow part here as the data is shuffled. So some things to try:

  1. 更改数据框输入分区
  2. 更改spark.sql.shuffle.partitions(加入后默认为您提供200个分区)
  3. 您的数据集看起来足够小,可以在其中使用spark.sql.functions.broadcast(dataset)进行地图端连接
  4. 这些向量是稀疏的还是密集的?该算法在sparseVectors上效果更好.
  1. change your dataframe input partitioning
  2. change spark.sql.shuffle.partitions (the default gives you 200 partitions after a join)
  3. your dataset looks small enough where you could use spark.sql.functions.broadcast(dataset) for a map-side join
  4. Are these vectors sparse or dense? the algorithm works better with sparseVectors.

这4个选项2和3最适合我,同时始终使用sparseVectors.

Of these 4 options 2 and 3 have worked best for me while always using sparseVectors.

这篇关于LSH Spark永久卡在roximateSimilarityJoin()函数中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆