如何使用Spark DataFrames进行分层采样? [英] How to do Stratified sampling with Spark DataFrames?
问题描述
我在Spark 1.3.0中,我的数据在DataFrames中. 我需要诸如sampleByKey(),sampleByKeyExact()之类的操作. 我看到了JIRA将近似分层抽样添加到DataFrame"( https://issues.apache. org/jira/browse/SPARK-7157 ). 这是针对Spark 1.5的,直到实现,这是在DataFrames上完成sampleByKey()和sampleByKeyExact()等效的最简单方法. 谢谢&问候 MK
I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK
推荐答案
Spark 1.1 将分层采样例程SampleByKey
和SampleByKeyExact
添加到了Spark Core,因此从那时起,它们就可以不依赖MLLib了.
Spark 1.1 added stratified sampling routines SampleByKey
and SampleByKeyExact
to Spark Core, so since then they are available without MLLib dependencies.
这两个函数是PairRDDFunctions
,属于键值RDD[(K,T)]
.另外,DataFrame没有键.您必须使用基础的RDD-如下所示:
These two functions are PairRDDFunctions
and belong to key-value RDD[(K,T)]
. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:
val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key
val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)
请注意,sample
现在不是RDD,而不是DataFrame,但是由于您已经为df
定义了架构,因此可以轻松地将其转换回DataFrame.
Note that sample
is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df
.
这篇关于如何使用Spark DataFrames进行分层采样?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!