火花访问前 n 行 - 采取与限制 [英] spark access first n rows - take vs limit
问题描述
我想访问 spark 数据框的前 100 行并将结果写回 CSV 文件.
I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.
为什么 take(100)
基本上是即时的,而
Why is take(100)
basically instant, whereas
df.limit(100)
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.option("header", true)
.option("delimiter", ";")
.csv("myPath")
需要永远.我不想获取每个分区的前 100 条记录,而只想获取任何 100 条记录.
takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.
为什么 take()
比 limit()
快这么多?
Why is take()
so much faster than limit()
?
推荐答案
这是因为目前 Spark 不支持谓词下推,参见 这个很好的答案.
This is because predicate pushdown is currently not supported in Spark, see this very good answer.
实际上,take(n) 也应该花费很长时间.但是,我刚刚对其进行了测试,并得到了与您相同的结果 - 无论数据库大小如何,take 几乎都是即时的,而 limit 需要很多时间.
Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.
这篇关于火花访问前 n 行 - 采取与限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!