Spark Java:使用给定架构创建新的数据集 [英] Spark java : Creating a new Dataset with a given schema
本文介绍了Spark Java:使用给定架构创建新的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有这段代码在scala中运行良好:
I have this code that is working well in scala :
val schema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", TimestampType, true),
StructField("field3", DoubleType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true)
))
val df = spark.read
// some options
.schema(schema)
.load(myEndpoint)
我想用Java做类似的事情.所以我的代码如下:
I want to do something similar in Java. So my code is the following :
final StructType schema = new StructType(new StructField[] {
new StructField("field1", new StringType(), true,new Metadata()),
new StructField("field2", new TimestampType(), true,new Metadata()),
new StructField("field3", new StringType(), true,new Metadata()),
new StructField("field4", new StringType(), true,new Metadata()),
new StructField("field5", new StringType(), true,new Metadata())
});
Dataset<Row> df = spark.read()
// some options
.schema(schema)
.load(myEndpoint);
但这给我以下错误:
Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType@37c5b8e8 (of class org.apache.spark.sql.types.StringType)
我的架构似乎没有什么问题,所以我真的不知道问题出在哪里.
Nothing seem wrong with my schemas so I don't really know what the problem is here.
spark.read().load(myEndpoint).printSchema();
root
|-- field5: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field1: string (nullable = true)
|-- field4: string (nullable = true)
|-- field3: string (nullable = true)
schema.printTreeString();
root
|-- field1: string (nullable = true)
|-- field2: timestamp (nullable = true)
|-- field3: string (nullable = true)
|-- field4: string (nullable = true)
|-- field5: string (nullable = true)
这是一个数据样本:
spark.read().load(myEndpoint).show(false);
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|field5 |field2 |field1 |field4 |field3 |
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:54:50|SOME_VALUE |SOME_VALUE |0.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 16:58:50|SOME_VALUE |SOME_VALUE |50.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 17:00:50|SOME_VALUE |SOME_VALUE |20.0 |
|{"fieldA":"AAA","fieldB":"BBB","fieldC":"CCC","fieldD":"DDD"} |2018-01-20 18:04:50|SOME_VALUE |SOME_VALUE |10.0 |
...
+---------------------------------------------------------------+-------------------+-------------+--------------+---------+
推荐答案
使用 Datatypes
类中的静态方法和字段,而构造函数在Spark 2.3.1中为我工作:
Using the static methods and fields from the Datatypes
class instead the constructors worked for me in Spark 2.3.1:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("field1", DataTypes.StringType, true),
DataTypes.createStructField("field2", DataTypes.TimestampType, true),
DataTypes.createStructField("field3", DataTypes.StringType, true),
DataTypes.createStructField("field4", DataTypes.StringType, true),
DataTypes.createStructField("field5", DataTypes.StringType, true)
});
这篇关于Spark Java:使用给定架构创建新的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文