我的 spark sql 限制很慢 [英] my spark sql limit is very slow

查看:30
本文介绍了我的 spark sql 限制很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用spark从elasticsearch中读取数据.就像

I use spark to read from elasticsearch.Like

select col from index limit 10;

问题是索引非常大,包含1000亿行.而spark生成数千个任务来完成工作.
我只需要 10 行,即使 1 个任务也返回 10 行可以完成工作.我不需要这么多任务.
限制是非常慢的,即使是限制 1.
代码:

The problem is that the index is very large, it contains 100 billion rows.And spark generate thousands of tasks to finish the job.
All I need is 10 rows, even 1 tasks returns 10 rows that can finish the job.I don't need so many tasks.
Limit is very slow even limit 1.
Code:

sql = select col from index limit 10
sqlExecListener.sparkSession.sql(sql).createOrReplaceTempView(tempTable)

推荐答案

limit 的源代码 显示它将为每个分区取第一个 limit 元素,然后它将扫描所有分区.

The source code of limit shows that it will take the first limit elements for every partition, and then it will scan all partitions.

为了加快查询速度,您可以指定分区键的一个值.假设你使用 day 作为分区键,下面的查询会快很多

To speed up the query you can specify one value of the partition key. Suppose that you are using day as the partition key, the following query will be much faster

select col from index where day = '2018-07-10' limit 10;

这篇关于我的 spark sql 限制很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆