我的spark sql限制非常慢 [英] my spark sql limit is very slow

查看:612
本文介绍了我的spark sql限制非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用spark来读取elasticsearch.就像

I use spark to read from elasticsearch.Like

select col from index limit 10;

问题在于索引非常大,它包含1000亿行,并且spark会生成数千个任务来完成工作.
我只需要10行,即使1个任务也可以返回10行就可以完成工作.我不需要那么多任务.
极限很慢,甚至极限1.
代码:

The problem is that the index is very large, it contains 100 billion rows.And spark generate thousands of tasks to finish the job.
All I need is 10 rows, even 1 tasks returns 10 rows that can finish the job.I don't need so many tasks.
Limit is very slow even limit 1.
Code:

sql = select col from index limit 10
sqlExecListener.sparkSession.sql(sql).createOrReplaceTempView(tempTable)

推荐答案

The source code of limit shows that it will take the first limit elements for every partition, and then it will scan all partitions.

要加快查询速度,可以指定分区键的一个值.假设您使用day作为分区键,则以下查询会更快

To speed up the query you can specify one value of the partition key. Suppose that you are using day as the partition key, the following query will be much faster

select col from index where day = '2018-07-10' limit 10;

这篇关于我的spark sql限制非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆