使用 Spark 和 Java 进行分层抽样 [英] Stratified sampling with Spark and Java

查看:22
本文介绍了使用 Spark 和 Java 进行分层抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想确保我正在对数据的分层样本进行训练.

I'd like to make sure I'm training on a stratified sample of my data.

Spark 2.1 和更早版本似乎通过 JavaPairRDD.sampleByKey(...)JavaPairRDD.sampleByKeyExact(...) 支持这一点,如此处.

It seems this is supported by Spark 2.1 and earlier versions via JavaPairRDD.sampleByKey(...) and JavaPairRDD.sampleByKeyExact(...) as explained here.

但是:我的数据存储在 Dataset 中,而不是 JavaPairRDD.第一列是标签,其他都是特征(从 libsvm 格式的文件导入).

But: My data is stored in a Dataset<Row>, not a JavaPairRDD. The first column is the label, all others are features (imported from a libsvm-formatted file).

获取我的数据集实例的分层样本并最终再次获得 Dataset 的最简单方法是什么?

What's the easiest way to get a stratified sample of my dataset instance and at the end have a Dataset<Row> again?

在某种程度上,这个问题与处理不平衡数据集有关在 Spark MLlib 中.

这个可能重复没有提到Dataset 根本没有,Java 中也没有.它没有回答我的问题.

This possible duplicate does not mention Dataset<Row> at all, neither is it in Java. It does not answer my question.

推荐答案

好的,自从 这里的问题实际上不是针对 Java 的,我已经用 Java 重写了它.

Ok, since the answer of the question here was actually not intended for Java, I have rewritten it in Java.

推理还是一样的想法.我们仍在使用 sampleByKeyExact.目前没有开箱即用的奇迹功能 (spark 2.1.0)

The reasoning is still the same thought. We are still using sampleByKeyExact. There is no out of the box miracle features for now (spark 2.1.0)

给你:

package org.awesomespark.examples;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.*;
import scala.Tuple2;

import java.util.Map;

public class StratifiedDatasets {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("Stratified Datasets")
                .getOrCreate();

        Dataset<Row> data = spark.read().format("libsvm").load("sample_libsvm_data.txt");

        JavaPairRDD<Double, Row> rdd = data.toJavaRDD().keyBy(x -> x.getDouble(0));
        Map<Double, Double> fractions = rdd.map(Tuple2::_1)
                .distinct()
                .mapToPair((PairFunction<Double, Double, Double>) (Double x) -> new Tuple2(x, 0.8))
                .collectAsMap();

        JavaRDD<Row> sampledRDD = rdd.sampleByKeyExact(false, fractions, 2L).values();
        Dataset<Row> sampledData = spark.createDataFrame(sampledRDD, data.schema());

        sampledData.show();
        sampledData.printSchema();
    }
}

现在让我们打包并提交:

Now let's package and submit :

$ sbt package
[...]
// [success] Total time: 2 s, completed Jan 16, 2017 1:45:51 PM

$ spark-submit --class org.awesomespark.examples.StratifiedDatasets target/scala-2.10/java-stratified-dataset_2.10-1.0.jar 
[...]

// +-----+--------------------+
// |label|            features|
// +-----+--------------------+
// |  0.0|(692,[127,128,129...|
// |  1.0|(692,[158,159,160...|
// |  1.0|(692,[124,125,126...|
// |  1.0|(692,[152,153,154...|
// |  1.0|(692,[151,152,153...|
// |  0.0|(692,[129,130,131...|
// |  1.0|(692,[99,100,101,...|
// |  0.0|(692,[154,155,156...|
// |  0.0|(692,[127,128,129...|
// |  1.0|(692,[154,155,156...|
// |  0.0|(692,[151,152,153...|
// |  1.0|(692,[129,130,131...|
// |  0.0|(692,[154,155,156...|
// |  1.0|(692,[150,151,152...|
// |  0.0|(692,[124,125,126...|
// |  0.0|(692,[152,153,154...|
// |  1.0|(692,[97,98,99,12...|
// |  1.0|(692,[124,125,126...|
// |  1.0|(692,[156,157,158...|
// |  1.0|(692,[127,128,129...|
// +-----+--------------------+
// only showing top 20 rows

// root
//  |-- label: double (nullable = true)
//  |-- features: vector (nullable = true)

对于 python 用户,您还可以查看我的回答 使用 pyspark 进行分层采样.

For python users, you can also check my answer Stratified sampling with pyspark.

这篇关于使用 Spark 和 Java 进行分层抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆