仅在RDD的子集上执行操作 [英] Performing operations only on subset of a RDD

查看：60 发布时间：2021/4/8 19:39:11 apache-spark

本文介绍了仅在RDD的子集上执行操作的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只想对RDD的一个子集执行一些转换(以更快地进行REPL实验).

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).

有可能吗?

RDD有 take(num:Int):Array [T] 方法，我想我需要类似的东西，但是返回RDD [T]

RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]

推荐答案

您可以使用 RDD.sample 来获取 RDD ，而不是 Array .例如，要采样〜1％而不进行替换:

You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:

val data = ...
data.count
...
res1: Long = 18066983

val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190

第三个参数是种子，值得庆幸的是，在下一个Spark版本中，该参数是可选的.

The third parameter is a seed, and is thankfully optional in the next Spark version.

这篇关于仅在RDD的子集上执行操作的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

仅在RDD的子集上执行操作 [英] Performing operations only on subset of a RDD

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

仅在RDD的子集上执行操作 [英] Performing operations only on subset of a RDD

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭