在星火分层抽样 [英] Stratified sampling in Spark

查看:275
本文介绍了在星火分层抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含用户和购买数据的数据集。下面是一个例子,其中第一个元素是用户id,第二个是的productId和第三表明是布尔。

 (2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...

我要确保我只需要每个用户的数据的80%,建设RDD而采取的20%,其余部分,并建立一个又一个RDD。让我们称之为火车和测试。我想使用GROUPBY入手自从数据集是大它可以创建内存问题望而却步。最新最好的方式做到这一点?

我能做到以下,但这不会给每个用户的80%。

  VAL percentData = Data.Map中(X =方式>((的Math.random * 100).toInt,x._1 x._2,x._3)
VAL列车= percentData.filter(X => x._1< 80).values​​.repartition(10).cache()


解决方案

一种可能性是在霍顿的回答,这是另一个问题:

您可以使用sampleByKeyExact转型,从<一个href=\"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions\"相对=nofollow> PairRDDFunctions 类,它仍然是实验性的,现在(星火1.4.1)


  

sampleByKeyExact (布尔withReplacement,scala.collection.Map分数,长籽)
  :: ::实验这回由RDD抽样关键的一个子集(通过分层抽样)正好包含math.ceil(numItems的*采样速率)为每一层(对组使用相同的密钥)。


这是我会怎么做:

考虑以下列表:

  VAL列表= List((2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1))

我想创建一个RDD配对,所有用户映射为键:

  VAL数据= sc.parallelize(list.toSeq).MAP(X =&GT;(x._1,(x._2,x._3)))

我将建立部分每个键如下因为你已经注意到,在部分参数的 sampleByKeyExact 会针对每个关键部分的地图:

  VAL分数= Data.Map中(_._ 1).distinct.map(X =&GT;(X,0.8))。collectAsMap

我在这里做实际上是映射的键查找不同的,然后关联的每个关键分数等于OT 0.8然后我收集了整个作为一个地图。

要品尝现在,所有我必须做的是:

 进口org.apache.spark.rdd.PairRDDFunctions
VAL的sampleData = data.sampleByKeyExact(假,分数,2L)

  VAL的sampleData = data.sampleByKeyExact(withReplacement =假,分数=分数,种子= 2L)

您可以检查你的钥匙或数据或数据样本数:

 斯卡拉&GT; data.count
[...]
res10:龙= 12斯卡拉&GT; sampleData.count
[...]
res11:龙= 10

I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.

(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
... 

I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?

I could do following but this will not give 80% of each user.

val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()

解决方案

One possibility is in Holden's answer, and this is another one :

You can use the sampleByKeyExact transformation, from the PairRDDFunctions class, which is still experimental for now (Spark 1.4.1)

sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed) ::Experimental:: Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).

And this is how I would do :

Considering the following list :

val list = List((2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1))

I would create an RDD Pair, mapping all the users as keys :

val data = sc.parallelize(list.toSeq).map(x => (x._1,(x._2,x._3)))

I'll set up the fractions for each key as following since you've noticed that the fractions argument in sampleByKeyExact takes a Map of fraction for each key :

val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap

What I have done here is actually mapping on the keys to find distinct and then associate each key to a fraction equals ot 0.8 then I collect the whole as a Map.

To sample now, all I have to do is :

import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false,fractions,2L)

or

val sampleData = data.sampleByKeyExact(withReplacement = false,fractions = fractions,seed = 2L)

You can check the count on your keys or data or data sample :

scala > data.count
[...]
res10: Long = 12

scala > sampleData.count
[...]
res11: Long = 10

这篇关于在星火分层抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆