数据框到类型为Any的数据集 [英] Dataframe to Dataset which has type Any

查看:63
本文介绍了数据框到类型为Any的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近从Spark 1.6迁移到Spark 2.X,并且我想尽可能地从Dataframes迁移到Datasets.我尝试过这样的代码

I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this

case class MyClass(a : Any, ...)

val df = ...
df.map(x => MyClass(x.get(0), ...))

如您所见,MyClass具有一个类型为Any的字段,因为在编译时我不知道我使用x.get(0)检索到的字段的类型.可能是long,string,int等.

As you can see MyClass has a field of type Any, as I do not know at compile time the type of the field I retrieve with x.get(0). It may be a long, string, int, etc.

但是,当我尝试执行类似于您在上面看到的代码时,出现异常:

However, when I try to execute code similar to what you see above, I get an exception:

java.lang.ClassNotFoundException: scala.Any

通过一些调试,我意识到引发了异常,这不是因为我的数据是Any类型,而是因为MyClass的类型是Any.那么我该如何使用数据集?

With some debugging, I realized that the exception is raised, not because my data is of type Any, but because MyClass has a type Any. So how can I use Datasets then?

推荐答案

除非您对有限且丑陋的解决方法感兴趣Encoders.kryo:

import org.apache.spark.sql.Encoders

case class FooBar(foo: Int, bar: Any)

spark.createDataset(
  sc.parallelize(Seq(FooBar(1, "a")))
)(Encoders.kryo[FooBar])

spark.createDataset(
  sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
)(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))

你不知道. Dataset中的所有字段/列都必须是已知的同类类型,并且在范围内存在隐式的Encoder.那里根本没有Any的地方.

you don't. All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there.

UDT API提供了更多的灵活性,并允许有限的多态性,但它是私有的,与Dataset API并不完全兼容,并且具有显着的性能和存储损失.

UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset API and comes with significant performance and storage penalty.

如果对于给定的执行,所有相同类型的值,您当然可以创建专门的类,并确定在运行时使用哪个类.

If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.

这篇关于数据框到类型为Any的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆