MongoDB Spark Connector-聚合速度很慢 [英] MongoDB Spark Connector - aggregation is slow

查看：68 发布时间：2021/4/8 19:33:08 mongodb apache-spark mongodb-query mongodb-java

本文介绍了MongoDB Spark Connector-聚合速度很慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Spark应用程序和Mongos控制台上运行相同的聚合管道.在控制台上，眨眼间即可获取数据，并且只需第二次使用"it"即可检索所有期望的数据.但是，根据Spark WebUI，Spark应用程序将花费近两分钟的时间.

I am running the same aggregation pipeline with a Spark Application and on the Mongos console. On the console, the data is fetched within the blink of an eye, and only a second use of "it" is needed to retrieve all expected data. The Spark Application however takes almost two minutes according to the Spark WebUI.

如您所见，正在启动242个任务以获取结果.我不知道为什么在MongoDB聚合仅返回40个文档的同时启动如此大量的任务.看起来开销很高.

As you can see, 242 tasks are being launched to fetch the result. I am not sure why such an high amount of tasks is launched while there are only 40 documents being returned by the MongoDB aggregation. It looks like there is a high overhead.

我在Mongos控制台上运行的查询:

The query I run on the Mongos console:

db.data.aggregate([
   {
      $match:{
         signals:{
            $elemMatch:{
               signal:"SomeSignal",
               value:{
                  $gt:0,
                  $lte:100
               }
            }
         }
      }
   },
   {
      $group:{
         _id:"$root_document",
         firstTimestamp:{
            $min:"$ts"
         },
         lastTimestamp:{
            $max:"$ts"
         },
         count:{
            $sum:1
         }
      }
   }
])

Spark应用程序代码

The Spark Application code

    JavaMongoRDD<Document> rdd = MongoSpark.load(sc);

    JavaMongoRDD<Document> aggregatedRdd = rdd.withPipeline(Arrays.asList(
            Document.parse(
                    "{ $match: { signals: { $elemMatch: { signal: \"SomeSignal\", value: { $gt: 0, $lte: 100 } } } } }"),
            Document.parse(
                    "{ $group : { _id : \"$root_document\", firstTimestamp: { $min: \"$ts\"}, lastTimestamp: { $max: \"$ts\"} , count: { $sum: 1 } } }")));

    JavaRDD<String> outputRdd = aggregatedRdd.map(new Function<Document, String>() {
        @Override
        public String call(Document arg0) throws Exception {
            String output = String.format("%s;%s;%s;%s", arg0.get("_id").toString(),
                    arg0.get("firstTimestamp").toString(), arg0.get("lastTimestamp").toString(),
                    arg0.get("count").toString());
            return output;
        }
    });

    outputRdd.saveAsTextFile("/user/spark/output");

在那之后，我使用 hdfs dfs -getmerge/user/spark/output/output.csv 并比较结果.

After that, I use hdfs dfs -getmerge /user/spark/output/ output.csv and compare the results.

为什么聚合如此缓慢?调用 withPipeline 难道不是要减少需要传输到Spark的数据量吗?看起来它没有像Mongos控制台那样进行聚合.在Mongos控制台上，它的速度非常快.我正在使用Spark 1.6.1和mongo-spark-connector_2.10版本1.1.0.

Why is the aggregation so slow? Isn't the call to withPipeline meant to reduce the amount of data needed to be transfered to Spark? It looks like it isn't doing the same aggregation the Mongos console does. On the Mongos console it is blazing fast. I am using Spark 1.6.1 and mongo-spark-connector_2.10 version 1.1.0.

我想知道的另一件事是启动了两个执行器(因为我使用的是默认执行设置atm)，但是只有一个执行器可以完成所有工作.第二个执行者为什么不做任何工作?

Another thing I am wondering about is that two executors get launched (because I am using the default execution settings atm), but only one executor does all the work. Why isn't the second executor doing any work?

当使用其他聚合管道并调用 .count()而不是 saveAsTextFile(..)时，还将创建242个任务.这次将返回65.000个文档.

Edit 2: When using a different aggregation pipeline and calling .count() instead of saveAsTextFile(..), there are also 242 tasks being created. This time there will be 65.000 documents returned.

MongoDB Spark Connector-聚合速度很慢 [英] MongoDB Spark Connector - aggregation is slow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

MongoDB Spark Connector-聚合速度很慢 [英] MongoDB Spark Connector - aggregation is slow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭