在星火分层抽样 [英] Stratified sampling in Spark
问题描述
我有一个包含用户和购买数据的数据集。下面是一个例子,其中第一个元素是用户id,第二个是的productId和第三表明是布尔。
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
我要确保我只需要每个用户的数据的80%,建设RDD而采取的20%,其余部分,并建立一个又一个RDD。让我们称之为火车和测试。我想使用GROUPBY入手自从数据集是大它可以创建内存问题望而却步。最新最好的方式做到这一点?
我能做到以下,但这不会给每个用户的80%。
VAL percentData = Data.Map中(X =方式>((的Math.random * 100).toInt,x._1 x._2,x._3)
VAL列车= percentData.filter(X => x._1< 80).values.repartition(10).cache()
一种可能性是在霍顿的回答,这是另一个问题:
您可以使用sampleByKeyExact转型,从<一个href=\"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions\"相对=nofollow> PairRDDFunctions 类,它仍然是实验性的,现在(星火1.4.1)
sampleByKeyExact (布尔withReplacement,scala.collection.Map分数,长籽)
:: ::实验这回由RDD抽样关键的一个子集(通过分层抽样)正好包含math.ceil(numItems的*采样速率)为每一层(对组使用相同的密钥)。
块引用>这是我会怎么做:
考虑以下列表:
VAL列表= List((2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1))
我想创建一个RDD配对,所有用户映射为键:
VAL数据= sc.parallelize(list.toSeq).MAP(X =&GT;(x._1,(x._2,x._3)))
我将建立部分每个键如下因为你已经注意到,在部分参数的 sampleByKeyExact 会针对每个关键部分的地图:
VAL分数= Data.Map中(_._ 1).distinct.map(X =&GT;(X,0.8))。collectAsMap
我在这里做实际上是映射的键查找不同的,然后关联的每个关键分数等于OT 0.8然后我收集了整个作为一个地图。
要品尝现在,所有我必须做的是:
进口org.apache.spark.rdd.PairRDDFunctions
VAL的sampleData = data.sampleByKeyExact(假,分数,2L)或
VAL的sampleData = data.sampleByKeyExact(withReplacement =假,分数=分数,种子= 2L)
您可以检查你的钥匙或数据或数据样本数:
斯卡拉&GT; data.count
[...]
res10:龙= 12斯卡拉&GT; sampleData.count
[...]
res11:龙= 10I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1) (2147481832,973010692,1) (2147481832,2134870842,1) (2147481832,541023347,1) (2147481832,1682206630,1) (2147481832,1138211459,1) (2147481832,852202566,1) (2147481832,201375938,1) (2147481832,486538879,1) (2147481832,919187908,1) ...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3) val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
解决方案One possibility is in Holden's answer, and this is another one :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class, which is still experimental for now (Spark 1.4.1)
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed) ::Experimental:: Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val list = List((2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1))
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(list.toSeq).map(x => (x._1,(x._2,x._3)))
I'll set up the fractions for each key as following since you've noticed that the fractions argument in sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is actually mapping on the keys to find distinct and then associate each key to a fraction equals ot 0.8 then I collect the whole as a Map.
To sample now, all I have to do is :
import org.apache.spark.rdd.PairRDDFunctions val sampleData = data.sampleByKeyExact(false,fractions,2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false,fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count [...] res10: Long = 12 scala > sampleData.count [...] res11: Long = 10
这篇关于在星火分层抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!