Spark查询运行很慢 [英] Spark query running very slow

查看:46
本文介绍了Spark查询运行很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 AWS 上有一个集群,有 2 个从站和 1 个主站.所有实例都是 m1.large 类型.我正在运行 Spark 1.4 版.我正在对来自红移的超过 400 万数据的火花性能进行基准测试.我通过 pyspark shell 发起了一个查询

 df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")df.registerTempTable('test')d=sqlContext.sql("""从 (选择 -- (i1)总和),用户身份从(选择--(i2)avg(total) 作为总数,用户身份从测试通过...分组订单编号,user_id) 作为通过...分组用户身份总和(总计)>0) 作为 b""")

当我执行 d.count() 时,上述查询在 df 未缓存时需要 30 秒,当 df 缓存在内存中时需要 17 秒.

我希望这些时间更接近 1-2 秒.

这些是我的火花配置:

spark.executor.memory 6154mspark.driver.memory 3gspark.shuffle.spill falsespark.default.parallelism 8

rest 设置为其默认值.有人能看到我在这里遗漏了什么吗?

解决方案

这很正常,除非 Spark 像 mysql 或 postgres 那样在几毫秒内运行.与其他大数据解决方案(如 Hive、Impala)相比,Spark 具有低延迟性……您无法将其与经典数据库相比,Spark 不是数据索引的数据库!

观看此视频:

您是否尝试过 Apache Drill?我发现它快了一点(我将它用于小型 HDFS JSON 文件,2/3Gb,比用于 SQL 查询的 Spark 快得多).

i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell

    df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
    df.registerTempTable('test')
    d=sqlContext.sql("""

    select user_id from (

    select -- (i1)

        sum(total),

        user_id

    from

        (select --(i2)

            avg(total) as total,

            user_id

        from

                test

        group by

            order_id,

            user_id) as a

    group by

        user_id

    having sum(total) > 0

    ) as b
"""
)

When i do d.count(), the above query takes 30 sec when df is not cached and 17sec when df is cached in memory.

I'm expecting these timings to be closer to 1-2s.

These are my spark configurations:

spark.executor.memory 6154m
spark.driver.memory 3g
spark.shuffle.spill false
spark.default.parallelism 8

rest is set to its default values. Can any one see what i'm missing here ?

解决方案

This is normal, don't except Spark to run in a few milli-secondes like mysql or postgres do. Spark is low latency compared to other big data solutions like Hive, Impala... you cannot compare it with classic database, Spark is not a database where data are indexed!

Watch this video: https://www.youtube.com/watch?v=8E0cVWKiuhk

They clearly put Spark here:

Did you try Apache Drill? I found it a bit faster (I use it for small HDFS JSON files, 2/3Gb, much faster than Spark for SQL queries).

这篇关于Spark查询运行很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆