如何从自定义类 Person 创建数据集? [英] How to create a Dataset from custom class Person?

查看:24
本文介绍了如何从自定义类 Person 创建数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用 Java 创建一个 Dataset,所以我写了以下代码:

I was trying to create a Dataset in Java, so I write the following code:

public Dataset createDataset(){
  List<Person> list = new ArrayList<>();
  list.add(new Person("name", 10, 10.0));
  Dataset<Person> dateset = sqlContext.createDataset(list, Encoders.bean(Person.class));
  return dataset;
}

Person 类是一个内部类.

然而,Spark 会抛出以下异常:

Spark however throws the following exception:

org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner class `....` without access to the scope that this class was defined in. Try moving this class out of its parent class.;

at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:264)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:260)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)

如何正确操作?

推荐答案

tl;dr(仅在 Spark shell 中)首先定义你的案例类,一旦它们定义,使用它们.在 Spark/Scala 应用程序中使用案例类应该可以正常工作.

tl;dr (Only in Spark shell) Define your case classes first and, once they are defined, use them. Using case classes in Spark/Scala applications should just work.

2.0.1 中的 Spark shell 中,您应该首先定义 case 类,然后才可以访问它们以创建 Dataset.

In 2.0.1 in Spark shell you should define case classes first and only then access them to create a Dataset.

$ ./bin/spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
      /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
Branch master
Compiled by user jacek on 2016-10-25T04:20:04Z
Revision 483c37c581fedc64b218e294ecde1a7bb4b2af9c
Url https://github.com/apache/spark.git
Type --help for more information.

$ ./bin/spark-shell
scala> :pa
// Entering paste mode (ctrl-D to finish)

case class Person(id: Long)

Seq(Person(0)).toDS // <-- this won't work

// Exiting paste mode, now interpreting.

<console>:15: error: value toDS is not a member of Seq[Person]
       Seq(Person(0)).toDS // <-- it won't work
                      ^
scala> case class Person(id: Long)
defined class Person

scala> // the following implicit conversion *will* work

scala> Seq(Person(0)).toDS
res1: org.apache.spark.sql.Dataset[Person] = [id: bigint]

<小时>

43ebf7a9cbd70d6af75e140a6fc92bf0st 解决方案中 (43ebf7a9cbd70d6af75e140a6fc92bf0st.nap7s0st.nap7s.nap7s0st.nap7s.nap7s0st.nap7s.nap0st.nap7s2bf0st.NAP)已添加以解决此问题.


In 43ebf7a9cbd70d6af75e140a6fc91bf0ffc2b877 commit (Spark 2.0.0-SNAPSHOT at March 21st) the solution was added to work around the issue.

在 Scala REPL 中,我必须添加 OuterScopes.addOuterScope(this):paste 完整片段如下:

In Scala REPL I had to add OuterScopes.addOuterScope(this) while :paste the complete snippet as follows:

scala> :pa
// Entering paste mode (ctrl-D to finish)

import sqlContext.implicits._
case class Token(name: String, productId: Int, score: Double)
val data = Token("aaa", 100, 0.12) ::
  Token("aaa", 200, 0.29) ::
  Token("bbb", 200, 0.53) ::
  Token("bbb", 300, 0.42) :: Nil
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)
val ds = data.toDS

这篇关于如何从自定义类 Person 创建数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆