数据框到数据集类型为 Any [英] Dataframe to Dataset which has type Any
问题描述
我最近从 Spark 1.6 迁移到 Spark 2.X,并且我想在可能的情况下也从 Dataframes 迁移到 Datasets.我试过这样的代码
I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this
case class MyClass(a : Any, ...)
val df = ...
df.map(x => MyClass(x.get(0), ...))
如您所见,MyClass
有一个 Any
类型的字段,因为我在编译时不知道我用 x 检索的字段的类型.获取(0)
.可能是long、string、int等
As you can see MyClass
has a field of type Any
, as I do not know at compile time the type of the field I retrieve with x.get(0)
. It may be a long, string, int, etc.
但是,当我尝试执行类似于您在上面看到的代码时,出现异常:
However, when I try to execute code similar to what you see above, I get an exception:
java.lang.ClassNotFoundException: scala.Any
通过一些调试,我意识到引发了异常,不是因为我的数据是 Any
类型,而是因为 MyClass
有一个 Any类型代码>.那么我该如何使用数据集呢?
With some debugging, I realized that the exception is raised, not because my data is of type Any
, but because MyClass
has a type Any
. So how can I use Datasets then?
推荐答案
除非您对有限且丑陋的解决方法感兴趣 喜欢 Encoders.kryo
:
import org.apache.spark.sql.Encoders
case class FooBar(foo: Int, bar: Any)
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a")))
)(Encoders.kryo[FooBar])
或
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
)(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))
你没有.Dataset
中的所有字段/列都必须是已知的、同构的类型,并且在作用域中有一个隐式的 Encoder
.那里根本没有 Any
的地方.
you don't. All fields / columns in a Dataset
have to be of known, homogeneous type for which there is an implicit Encoder
in the scope. There is simply no place for Any
there.
UDT API 提供了更多的灵活性并允许有限的多态性,但它是私有的,与 Dataset
API 不完全兼容,并且会带来显着的性能和存储损失.
UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset
API and comes with significant performance and storage penalty.
如果对于一个给定的执行,所有的值都是相同的类型,你当然可以创建专门的类并决定在运行时使用哪个.
If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.
这篇关于数据框到数据集类型为 Any的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!