在Apache Spark Dataset< Row>上应用flatMap操作时出现意外的编码器行为. [英] Unexpected encoder behaviour when applying a flatMap operation on a Apache Spark Dataset<Row>

查看:78
本文介绍了在Apache Spark Dataset< Row>上应用flatMap操作时出现意外的编码器行为.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将实际上包含双精度值的csv字符串转换为spark-ml兼容的数据集.由于我事先不知道要预期的功能数量,因此我决定使用一个帮助器类实例",该类已经包含了分类器要使用的正确数据类型,并且在某些其他情况下已经可以按预期工作:

I'm trying to convert a csv-string that actually contains double values into a spark-ml compatible dataset. Since I don't know the number of features to be expected beforehand, I decided to use a helper class "Instance", that already contains the right datatypes to be used by the classifiers and that is working as intended in some other cases already:

public class Instance implements Serializable {
    /**
     * 
     */
    private static final long serialVersionUID = 6091606543088855593L;
    private Vector indexedFeatures;
    private double indexedLabel;
    ...getters and setters for both fields...
}

我得到意想不到的行为的部分是这个:

The part, where I get the unexpected behaviour is this one:

    Encoder<Instance> encoder = Encoders.bean(Instance.class);
    System.out.println("encoder.schema()");
    encoder.schema().printTreeString();
    Dataset<Instance> dfInstance = df.select("value").as(Encoders.STRING())
              .flatMap(s -> {
                String[] splitted = s.split(",");

                int length = splitted.length;
                double[] features = new double[length-1];
                for (int i=0; i<length-1; i++) {
                    features[i] = Double.parseDouble(splitted[i]);
                }

                if (length < 2) {
                    return Collections.emptyIterator();
                } else {
                    return Collections.singleton(new Instance( 
                        Vectors.dense(features), 
                        Double.parseDouble(splitted[length-1])
                        )).iterator();
                }
              }, encoder);

    System.out.println("dfInstance");
    dfInstance.printSchema();
    dfInstance.show(5);

然后我在控制台上得到以下输出:

And I get the following output on the console:

encoder.schema()
root
 |-- indexedFeatures: vector (nullable = true)
 |-- indexedLabel: double (nullable = false)

dfInstance
root
 |-- indexedFeatures: struct (nullable = true)
 |-- indexedLabel: double (nullable = true)

+---------------+------------+
|indexedFeatures|indexedLabel|
+---------------+------------+
|             []|         0.0|
|             []|         0.0|
|             []|         1.0|
|             []|         0.0|
|             []|         1.0|
+---------------+------------+
only showing top 5 rows

编码器架构正确地将indexedFeatures行数据类型显示为向量.但是,当我应用编码器并执行转换时,它将给我一行struct类型的行,其中不包含任何实际对象.

The encoder schema is correctly displaying the indexedFeatures row datatype to be a vector. But when I apply the encoder and do the transformation, it will give me a row of type struct, containing no real objects.

我想了解一下,为什么Spark为我提供了一种结构类型,而不是正确的向量.

I would like to understand, why Spark is providing me with a struct type instead of the correct vector one.

推荐答案

实际上,我的回答并不是解释为什么要获得结构类型.但是根据上一个问题,我可能可以提供一种解决方法.

Actually, my answer is not an explanation why you get a struct type. But based on the previous question, I can probably offer a workaround.

原始输入使用

The original input is parsed with DataFrameReader's csv function, and the again a VectorAssembler is used:

Dataset<Row> csv = spark.read().option("inferSchema", "true")
  .csv(inputDf.select("value").as(Encoders.STRING()));
String[] fieldNames = csv.schema().fieldNames();    
VectorAssembler assembler = new VectorAssembler().setInputCols(
  Arrays.copyOfRange(fieldNames, 0, fieldNames.length-1))
  .setOutputCol("indexedFeatures");
Dataset<Row> result = assembler.transform(csv)
  .withColumn("indexedLabel", functions.col(fieldNames[fieldNames.length-1]))
  .select("indexedFeatures", "indexedLabel");

这篇关于在Apache Spark Dataset&lt; Row&gt;上应用flatMap操作时出现意外的编码器行为.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆