如何使用Spark DataFrames进行分层采样? [英] How to do Stratified sampling with Spark DataFrames?

查看：461 发布时间：2020/9/4 3:39:57 apache-spark apache-spark-mllib

本文介绍了如何使用Spark DataFrames进行分层采样?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Spark 1.3.0中，我的数据在DataFrames中. 我需要诸如sampleByKey()，sampleByKeyExact()之类的操作. 我看到了JIRA将近似分层抽样添加到DataFrame"( https://issues.apache. org/jira/browse/SPARK-7157 ). 这是针对Spark 1.5的，直到实现，这是在DataFrames上完成sampleByKey()和sampleByKeyExact()等效的最简单方法. 谢谢&问候 MK

I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK

推荐答案

Spark 1.1 将分层采样例程SampleByKey和SampleByKeyExact添加到了Spark Core，因此从那时起，它们就可以不依赖MLLib了.

Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.

这两个函数是PairRDDFunctions，属于键值RDD[(K,T)].另外，DataFrame没有键.您必须使用基础的RDD-如下所示:

These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:

val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)

请注意，sample现在不是RDD，而不是DataFrame，但是由于您已经为df定义了架构，因此可以轻松地将其转换回DataFrame.

Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.

这篇关于如何使用Spark DataFrames进行分层采样?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Spark DataFrames进行分层采样? [英] How to do Stratified sampling with Spark DataFrames?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用Spark DataFrames进行分层采样? [英] How to do Stratified sampling with Spark DataFrames?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭