如何使用Spark DataFrames进行分层采样? [英] How to do Stratified sampling with Spark DataFrames?

查看:461
本文介绍了如何使用Spark DataFrames进行分层采样?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Spark 1.3.0中,我的数据在DataFrames中. 我需要诸如sampleByKey(),sampleByKeyExact()之类的操作. 我看到了JIRA将近似分层抽样添加到DataFrame"( https://issues.apache. org/jira/browse/SPARK-7157 ). 这是针对Spark 1.5的,直到实现,这是在DataFrames上完成sampleByKey()和sampleByKeyExact()等效的最简单方法. 谢谢&问候 MK

I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK

推荐答案

Spark 1.1 分层采样例程SampleByKeySampleByKeyExact添加到了Spark Core,因此从那时起,它们就可以不依赖MLLib了.

Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.

这两个函数是PairRDDFunctions,属于键值RDD[(K,T)].另外,DataFrame没有键.您必须使用基础的RDD-如下所示:

These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:

val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)

请注意,sample现在不是RDD,而不是DataFrame,但是由于您已经为df定义了架构,因此可以轻松地将其转换回DataFrame.

Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.

这篇关于如何使用Spark DataFrames进行分层采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆