使用Spark和Java进行分层采样 [英] Stratified sampling with Spark and Java

查看:124
本文介绍了使用Spark和Java进行分层采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想确保我正在对分层的数据样本进行训练.

I'd like to make sure I'm training on a stratified sample of my data.

Spark 2.1和更早版本通过JavaPairRDD.sampleByKey(...)JavaPairRDD.sampleByKeyExact(...)支持此功能,如

It seems this is supported by Spark 2.1 and earlier versions via JavaPairRDD.sampleByKey(...) and JavaPairRDD.sampleByKeyExact(...) as explained here.

但是:我的数据存储在Dataset<Row>中,而不是在JavaPairRDD中.第一列是标签,所有其他都是功能(从libsvm格式的文件导入).

But: My data is stored in a Dataset<Row>, not a JavaPairRDD. The first column is the label, all others are features (imported from a libsvm-formatted file).

获取我的数据集实例的分层样本的最简单方法是什么,最后再次输入Dataset<Row>?

What's the easiest way to get a stratified sample of my dataset instance and at the end have a Dataset<Row> again?

某种程度上,这个问题与有关处理不平衡数据集有关在Spark MLlib中 .

可能重复的内容根本没有提到Dataset<Row>,也没有提及它在Java中.它不能回答我的问题.

This possible duplicate does not mention Dataset<Row> at all, neither is it in Java. It does not answer my question.

推荐答案

好吧,因为问题实际上不是针对 Java 的,我已经用 Java 重写了它.

Ok, since the answer of the question here was actually not intended for Java, I have rewritten it in Java.

推理还是一样的想法.我们仍在使用sampleByKeyExact.目前没有开箱即用的奇迹功能(火花2.1.0 )

The reasoning is still the same thought. We are still using sampleByKeyExact. There is no out of the box miracle features for now (spark 2.1.0)

所以你去了:

package org.awesomespark.examples;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.*;
import scala.Tuple2;

import java.util.Map;

public class StratifiedDatasets {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("Stratified Datasets")
                .getOrCreate();

        Dataset<Row> data = spark.read().format("libsvm").load("sample_libsvm_data.txt");

        JavaPairRDD<Double, Row> rdd = data.toJavaRDD().keyBy(x -> x.getDouble(0));
        Map<Double, Double> fractions = rdd.map(Tuple2::_1)
                .distinct()
                .mapToPair((PairFunction<Double, Double, Double>) (Double x) -> new Tuple2(x, 0.8))
                .collectAsMap();

        JavaRDD<Row> sampledRDD = rdd.sampleByKeyExact(false, fractions, 2L).values();
        Dataset<Row> sampledData = spark.createDataFrame(sampledRDD, data.schema());

        sampledData.show();
        sampledData.printSchema();
    }
}

现在让我们打包并提交:

Now let's package and submit :

$ sbt package
[...]
// [success] Total time: 2 s, completed Jan 16, 2017 1:45:51 PM

$ spark-submit --class org.awesomespark.examples.StratifiedDatasets target/scala-2.10/java-stratified-dataset_2.10-1.0.jar 
[...]

// +-----+--------------------+
// |label|            features|
// +-----+--------------------+
// |  0.0|(692,[127,128,129...|
// |  1.0|(692,[158,159,160...|
// |  1.0|(692,[124,125,126...|
// |  1.0|(692,[152,153,154...|
// |  1.0|(692,[151,152,153...|
// |  0.0|(692,[129,130,131...|
// |  1.0|(692,[99,100,101,...|
// |  0.0|(692,[154,155,156...|
// |  0.0|(692,[127,128,129...|
// |  1.0|(692,[154,155,156...|
// |  0.0|(692,[151,152,153...|
// |  1.0|(692,[129,130,131...|
// |  0.0|(692,[154,155,156...|
// |  1.0|(692,[150,151,152...|
// |  0.0|(692,[124,125,126...|
// |  0.0|(692,[152,153,154...|
// |  1.0|(692,[97,98,99,12...|
// |  1.0|(692,[124,125,126...|
// |  1.0|(692,[156,157,158...|
// |  1.0|(692,[127,128,129...|
// +-----+--------------------+
// only showing top 20 rows

// root
//  |-- label: double (nullable = true)
//  |-- features: vector (nullable = true)

对于 python 用户,您还可以查看我的答案 pyspark进行分层采样.

For python users, you can also check my answer Stratified sampling with pyspark.

这篇关于使用Spark和Java进行分层采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆