火花访问前 n 行 - 采取与限制 [英] spark access first n rows - take vs limit

查看:27
本文介绍了火花访问前 n 行 - 采取与限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想访问 spark 数据框的前 100 行并将结果写回 CSV 文件.

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

为什么 take(100) 基本上是即时的,而

Why is take(100) basically instant, whereas

df.limit(100)
      .repartition(1)
      .write
      .mode(SaveMode.Overwrite)
      .option("header", true)
      .option("delimiter", ";")
      .csv("myPath")

需要永远.我不想获取每个分区的前 100 条记录,而只想获取任何 100 条记录.

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

为什么 take()limit() 快这么多?

Why is take() so much faster than limit()?

推荐答案

这是因为目前 Spark 不支持谓词下推,参见 这个很好的答案.

This is because predicate pushdown is currently not supported in Spark, see this very good answer.

实际上,take(n) 也应该花费很长时间.但是,我刚刚对其进行了测试,并得到了与您相同的结果 - 无论数据库大小如何,take 几乎都是即时的,而 limit 需要很多时间.

Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.

这篇关于火花访问前 n 行 - 采取与限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆