在EMR上运行Spark应用程序很慢 [英] Running Spark app on EMR is slow

查看：244 发布时间：2019/1/14 12:14:15 apache-spark java-8 mapreduce emr amazon-emr

本文介绍了在EMR上运行Spark应用程序很慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是Spark和MApReduce的新手，我在Elastic Map Reduce（EMR）AWS集群上运行Spark时遇到问题。
问题是在EMR上运行需要花费很多时间。

I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time.

例如，我在.csv文件中有几百万条记录，我在JavaRDD中读取和转换。对于Spark，在这个数据集上计算简单的mapToDouble（）和sum（）函数需要104.99秒。

For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset.

当我在没有Spark的情况下进行相同的计算时，使用Java8并将.csv文件转换为List，只花了0.5秒。（参见下面的代码）

While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds. (SEE code BELOW)

这是Spark代码（104.99秒）：

This is Spark code ( 104.99 seconds):

    private double getTotalUnits (JavaRDD<DataObject> dataCollection)
{
    if (dataCollection.count() > 0) 
    {
        return dataCollection
                .mapToDouble(data -> data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }
}

这是相同的Java代码，不使用spark（0.5秒）

And this is same Java code without using spark (0.5 seconds)

    private double getTotalOps(List<DataObject> dataCollection)
{
    if (dataCollection.size() > 0)
    {
        return dataCollection
                .stream()
                .mapToDouble(data -> data.getPrice() * data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }

}

我是EMR的新手Spark，所以我不知道，我应该怎么做才能解决这个问题？

I'm new to EMR and Spark, so I don't know, what should I do fix this problem?

UPDATE：
这是函数的一个例子。我的全部任务是计算不同的统计数据（总和，平均值，中位数）并对6 GB的数据执行不同的转换。这就是我决定使用Spark的原因。
包含6gb数据的整个应用程序使用常规Java运行大约需要3分钟，使用Spark和MapReduce运行需要18分钟

UPDATE: This is a single example of the function. My whole task is to calculate different statistics(sum,mean,median) and perform different transformations on 6 GB of data. That is why I decided to use Spark. The whole app with 6gb of data taking about 3 minutes to run using regular Java and 18 minutes to run using Spark and MapReduce

在EMR上运行Spark应用程序很慢 [英] Running Spark app on EMR is slow

问题描述

推荐答案

相关文章

Java相关最新文章

热门教程

热门工具

登录关闭

在EMR上运行Spark应用程序很慢 [英] Running Spark app on EMR is slow

问题描述

推荐答案

相关文章

Java相关最新文章

热门教程

热门工具

登录 关闭

登录关闭