我的 spark sql 限制很慢 [英] my spark sql limit is very slow
问题描述
我使用spark从elasticsearch中读取数据.就像
I use spark to read from elasticsearch.Like
select col from index limit 10;
问题是索引非常大,包含1000亿行.而spark生成数千个任务来完成工作.
我只需要 10 行,即使 1 个任务也返回 10 行可以完成工作.我不需要这么多任务.
限制是非常慢的,即使是限制 1.
代码:
The problem is that the index is very large, it contains 100 billion rows.And spark generate thousands of tasks to finish the job.
All I need is 10 rows, even 1 tasks returns 10 rows that can finish the job.I don't need so many tasks.
Limit is very slow even limit 1.
Code:
sql = select col from index limit 10
sqlExecListener.sparkSession.sql(sql).createOrReplaceTempView(tempTable)
推荐答案
limit 的源代码 显示它将为每个分区取第一个 limit
元素,然后它将扫描所有分区.
The source code of limit shows that it will take the first limit
elements for every partition, and then it will scan all partitions.
为了加快查询速度,您可以指定分区键的一个值.假设你使用 day
作为分区键,下面的查询会快很多
To speed up the query you can specify one value of the partition key. Suppose that you are using day
as the partition key, the following query will be much faster
select col from index where day = '2018-07-10' limit 10;
这篇关于我的 spark sql 限制很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!