数据框到数据集类型为 Any [英] Dataframe to Dataset which has type Any

查看:28
本文介绍了数据框到数据集类型为 Any的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近从 Spark 1.6 迁移到 Spark 2.X,并且我想在可能的情况下也从 Dataframes 迁移到 Datasets.我试过这样的代码

I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this

case class MyClass(a : Any, ...)

val df = ...
df.map(x => MyClass(x.get(0), ...))

如您所见,MyClass 有一个 Any 类型的字段,因为我在编译时不知道我用 x 检索的字段的类型.获取(0).可能是long、string、int等

As you can see MyClass has a field of type Any, as I do not know at compile time the type of the field I retrieve with x.get(0). It may be a long, string, int, etc.

但是,当我尝试执行类似于您在上面看到的代码时,出现异常:

However, when I try to execute code similar to what you see above, I get an exception:

java.lang.ClassNotFoundException: scala.Any

通过一些调试,我意识到引发了异常,不是因为我的数据是 Any 类型,而是因为 MyClass 有一个 Any.那么我该如何使用数据集呢?

With some debugging, I realized that the exception is raised, not because my data is of type Any, but because MyClass has a type Any. So how can I use Datasets then?

推荐答案

除非您对有限且丑陋的解决方法感兴趣 喜欢 Encoders.kryo:

import org.apache.spark.sql.Encoders

case class FooBar(foo: Int, bar: Any)

spark.createDataset(
  sc.parallelize(Seq(FooBar(1, "a")))
)(Encoders.kryo[FooBar])

spark.createDataset(
  sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
)(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))

你没有.Dataset 中的所有字段/列都必须是已知的、同构的类型,并且在作用域中有一个隐式的 Encoder.那里根本没有 Any 的地方.

you don't. All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there.

UDT API 提供了更多的灵活性并允许有限的多态性,但它是私有的,与 Dataset API 不完全兼容,并且会带来显着的性能和存储损失.

UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset API and comes with significant performance and storage penalty.

如果对于一个给定的执行,所有的值都是相同的类型,你当然可以创建专门的类并决定在运行时使用哪个.

If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.

这篇关于数据框到数据集类型为 Any的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆