如何使用 Spark DataFrames 进行分层采样? [英] How to do Stratified sampling with Spark DataFrames?

查看:30
本文介绍了如何使用 Spark DataFrames 进行分层采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Spark 1.3.0 中,我的数据在 DataFrames 中.我需要像 sampleByKey()、sampleByKeyExact() 这样的操作.我看到了 JIRA向数据帧添加近似分层抽样"(https://issues.apache.org/jira/browse/SPARK-7157).这是针对 Spark 1.5 的,直到它通过,在 DataFrames 上完成等效的 sampleByKey() 和 sampleByKeyExact() 的最简单方法是什么.谢谢&问候马克

I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK

推荐答案

Spark 1.1 添加 分层采样例程 SampleByKeySampleByKeyExact 到 Spark Core,因此从那时起它们就可以在没有 MLLib 依赖项的情况下使用.

Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.

这两个函数是PairRDDFunctions,属于键值RDD[(K,T)].DataFrames 也没有键.您必须使用底层 RDD - 如下所示:

These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:

val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)

请注意,sample 现在是 RDD 而不是 DataFrame,但是您可以轻松地将其转换回 DataFrame,因为您已经为 df 定义了架构.

Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.

这篇关于如何使用 Spark DataFrames 进行分层采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆