在EMR上运行Spark应用程序很慢 [英] Running Spark app on EMR is slow

查看:244
本文介绍了在EMR上运行Spark应用程序很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark和MApReduce的新手,我在Elastic Map Reduce(EMR)AWS集群上运行Spark时遇到问题。
问题是在EMR上运行需要花费很多时间。

I am new to Spark and MApReduce and I have a problem running Spark on Elastic Map Reduce (EMR) AWS cluster. Th problem is that running on EMR taking for me a lot of time.

例如,我在.csv文件中有几百万条记录,我在JavaRDD中读取和转换。对于Spark,在这个数据集上计算简单的mapToDouble()和sum()函数需要104.99秒。

For, example, I have a few millions record in .csv file, that I read and converted in JavaRDD. For Spark, it took 104.99 seconds to calculate simple mapToDouble() and sum() functions on this dataset.

当我在没有Spark的情况下进行相同的计算时,使用Java8并将.csv文件转换为List,只花了0.5秒。 (参见下面的代码)

While, when I did the same calculations without Spark, using Java8 and converting .csv file to List, it took only 0.5 seconds. (SEE code BELOW)

这是Spark代码(104.99秒):

This is Spark code ( 104.99 seconds):

    private double getTotalUnits (JavaRDD<DataObject> dataCollection)
{
    if (dataCollection.count() > 0) 
    {
        return dataCollection
                .mapToDouble(data -> data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }
}

这是相同的Java代码,不使用spark(0.5秒)

And this is same Java code without using spark (0.5 seconds)

    private double getTotalOps(List<DataObject> dataCollection)
{
    if (dataCollection.size() > 0)
    {
        return dataCollection
                .stream()
                .mapToDouble(data -> data.getPrice() * data.getQuantity())
                .sum();
    }
    else
    {
        return 0.0;
    }

}

我是EMR的新手Spark,所以我不知道,我应该怎么做才能解决这个问题?

I'm new to EMR and Spark, so I don't know, what should I do fix this problem?

UPDATE:
这是函数的一个例子。我的全部任务是计算不同的统计数据(总和,平均值,中位数)并对6 GB的数据执行不同的转换。这就是我决定使用Spark的原因。
包含6gb数据的整个应用程序使用常规Java运行大约需要3分钟,使用Spark和MapReduce运行需要18分钟

UPDATE: This is a single example of the function. My whole task is to calculate different statistics(sum,mean,median) and perform different transformations on 6 GB of data. That is why I decided to use Spark. The whole app with 6gb of data taking about 3 minutes to run using regular Java and 18 minutes to run using Spark and MapReduce

推荐答案

我相信你正在将橘子与苹果进行比较。
您必须了解何时使用BigData与普通Java程序?

I believe you are comparing Oranges to Apples. You must understand when to use BigData vs normal Java program?

大数据不适用于要处理的小型数据,Bigdata框架需要在分布式环境中执行多个管理任务,这是一个很大的开销。对于在hadoop平台中管理整个过程所花费的时间,在小数据的情况下实际处理时间可能非常小。因此,独立程序比BigData工具更有效,例如 mapreduce spark 等。

Big data is not for small size of data to process, The Bigdata framework needs to perform multiple management task in distributed environment, which is a significant overhead. The actual processing time in case of a small data may be very tiny w.r.to the time taken to manage the whole process in hadoop platform. Hence a standalone program is bount to perform better than BigData tools like mapreduce, spark etc.

如果您希望看到差异,请确保通过上述两个程序处理至少1 TB的数据,并比较处理相同的时间。

If you wish to see the difference, make sure to process at least 1 TB of data through the above two program and compare the time taken to process the same.

除此之外,BigData在处理过程中带来了容错能力。想一想 - 如果JVM崩溃(比如OutOfMEmoryError)正常的Java程序执行会怎么样?在普通的java程序中,只需整个进程崩溃。在Bigdata平台中,该框架确保不停止处理,并且发生故障恢复/重试过程。这使得它具有容错能力,并且您不会因为崩溃而放弃在数据的其他部分上完成的工作。

Apart from above point, BigData brings in fault tolerance in processing. Think about it - what would happen if the JVM crashes (say OutOfMEmoryError) normal Java program execution? In normal java program, simply the whole process collapses. In Bigdata platform, the framework ensures that the processing is not halted, and failure recovery/retry process take place. This makes it fault tolerant and you do not loose the work done on other part of data just because of a crash.

下表粗略解释,何时应切换到Big数据。

Below table roughly explain, when you should switch to Big Data.

这篇关于在EMR上运行Spark应用程序很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆