将JavaRDD转换为DataFrame时发生火花错误:java.util.Arrays $ ArrayList不是array< string>模式的有效外部类型. [英] Spark error when convert JavaRDD to DataFrame: java.util.Arrays$ArrayList is not a valid external type for schema of array<string>

查看:149
本文介绍了将JavaRDD转换为DataFrame时发生火花错误:java.util.Arrays $ ArrayList不是array< string>模式的有效外部类型.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark 2.1.0.对于以下代码,该代码读取一个文本文件并将内容转换为DataFrame,然后馈入Word2Vector模型:

I am using Spark 2.1.0. For the following code, which read a text file and convert the content to DataFrame, then feed into a Word2Vector model:

SparkSession spark = SparkSession.builder().appName("word2vector").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 10).toJavaRDD();
JavaRDD<List<String>> lists = lines.map(new Function<String, List<String>>(){
                public List<String> call(String line){
                    List<String> list = Arrays.asList(line.split(" "));
                    return list;
                }
            });

JavaRDD<Row> rows = lists.map(new Function<List<String>, Row>() {
                public Row call(List<String> list) {
                    return RowFactory.create(list);
                }
            });


StructType schema = new StructType(new StructField[] {
                        new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) 
                    });

Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(100).setMinCount(0);
Word2VecModel model = word2Vec.fit(input);
Dataset<Row> result = model.transform(input);

它引发异常

java.lang.RuntimeException:编码时出错:java.util.Arrays $ ArrayList不是有效的外部类型 数组架构

java.lang.RuntimeException: Error while encoding: java.util.Arrays$ArrayList is not a valid external type for schema of array

发生在第input.show(3)行,因此createDataFrame()导致了异常,因为Arrays.asList()返回了 Arrays $ ArrayList ,此处不支持.但是,Spark官方文档具有以下代码:

which happens at line input.show(3) , so the createDataFrame() is causing the exception because Arrays.asList() returns an Arrays$ArrayList which is not supported here. However the Spark Official Documentation has the following code:

List<Row> data = Arrays.asList(
      RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
      RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
      RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
    );

StructType schema = new StructType(new StructField[]{
      new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> documentDF = spark.createDataFrame(data, schema);

效果很好.如果不支持 Arrays $ ArrayList ,那么这段代码如何起作用?区别在于我将JavaRDD<Row>转换为DataFrame,但是官方文档将List<Row>转换为DataFrame.我相信Spark Java API有一个重载的方法createDataFrame(),它使用JavaRDD<Row>并将其基于提供的模式转换为DataFrame.我很困惑为什么它不起作用.有人可以帮忙吗?

which works just fine. If Arrays$ArrayList is not supported, how come this code is working? The difference is I am converting a JavaRDD<Row> to DataFrame but the official documentation is converting a List<Row> to DataFrame. I believe Spark Java API has an overloaded method createDataFrame() which takes a JavaRDD<Row> and convert it to a DataFrame based on the provided schema. I am so confused about why it is not working. Can anyone help?

推荐答案

几天前,我遇到了相同的问题,解决此问题的唯一方法是使用array数组.为什么 ?这是响应:

I encountered the same issue several days ago and the only way to solve this problem is the use an array of array. Why ? Here is the response:

ArrayType是Scala数组的包装器,它与Java数组一对一对应. Java ArrayList默认情况下不映射到Scala Array,因此这就是为什么您会得到异常的原因:

An ArrayType is wrapper for Scala Arrays which correspond one-to-one to Java arrays. Java ArrayList is not mapped by default to Scala Array so that's why you get the exception:

java.util.Arrays $ ArrayList不是数组模式的有效外部类型

java.util.Arrays$ArrayList is not a valid external type for schema of array

因此,直接传递String []可以完成工作:

Hence, passing directly a String[] sould have work:

RowFactory.create(line.split(" "))

但是由于 create 将对象列表作为输入,因为一行可能具有列列表,因此String []被解释为String列的列表.这就是为什么需要String的双精度数组的原因:

But since create takes as input an Object list as a row may have a columns list, the String[] get interpreted to a list of String columns. That's why a double array of String is required:

RowFactory.create(new String[][] {line.split(" ")})

但是,从spark文档中的Java行列表构造DataFrame仍然是个谜.这是因为SparkSession.createDataFrame函数版本以java.util.List of rows作为第一个参数,会进行特殊类型检查并进行转换,以便将所有Java Iterable(即ArrayList)转换为Scala Array. 但是,采用JavaRDD的SparkSession.createDataFrame直接将行内容映射到DataFrame.

However, still the mystery of constructing a DataFrame from a Java List of rows in the spark documentation. This is because the SparkSession.createDataFrame function version that takes as first parameter java.util.List of rows makes special type checks and converts so that it converts all Java Iterable (so ArrayList) to a Scala Array. However, the SparkSession.createDataFrame that takes JavaRDD maps directly the row content to the DataFrame.

总结,这是正确的版本:

To wrap-up, this is the correct version:

    SparkSession spark = SparkSession.builder().master("local[*]").appName("Word2Vec").getOrCreate();
    SparkContext sc = spark.sparkContext();
    sc.setLogLevel("WARN");
    JavaRDD<String> lines = sc.textFile("input.txt", 10).toJavaRDD();
    JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
        public Row call(String line){
            return RowFactory.create(new String[][] {line.split(" ")});
        }
    });

    StructType schema = new StructType(new StructField[] {
            new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
    });
    Dataset<Row> input = spark.createDataFrame(rows, schema);
    input.show(3);

希望这可以解决您的问题.

Hope this solves your problem.

这篇关于将JavaRDD转换为DataFrame时发生火花错误:java.util.Arrays $ ArrayList不是array&lt; string&gt;模式的有效外部类型.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆