火花访问前n行-限制 [英] spark access first n rows - take vs limit

查看:72
本文介绍了火花访问前n行-限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想访问spark数据帧的前100行,并将结果写回CSV文件.

I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

为什么take(100)基本上是即时的,而

Why is take(100) basically instant, whereas

df.limit(100)
      .repartition(1)
      .write
      .mode(SaveMode.Overwrite)
      .option("header", true)
      .option("delimiter", ";")
      .csv("myPath")

永远存在. 我不想获得每个分区的前100条记录,而只是获得任何100条记录.

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

推荐答案

这是因为Spark当前不支持谓词下推,请参见这个很好的答案.

This is because predicate pushdown is currently not supported in Spark, see this very good answer.

实际上,take(n)也应该花费很长时间.但是,我刚刚对其进行了测试,并得到了与您相同的结果-无论数据库大小如何,take几乎都是瞬时的,而limit需要很多时间.

Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.

这篇关于火花访问前n行-限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆