在Spark中为具有1500列的表定义DataFrame模式 [英] Defining DataFrame Schema for a table with 1500 columns in Spark

查看:72
本文介绍了在Spark中为具有1500列的表定义DataFrame模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在SQL Server中有一个包含约1500列的表.我需要从该表中读取数据,然后将其转换为正确的数据类型格式,然后将记录插入Oracle DB.

I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.

为表中包含1500列以上的这种类型的表定义架构的最佳方法是什么.除了对列名和数据类型进行硬编码之外,还有其他选择吗?

What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?

  1. 使用Case class
  2. 使用StructType.
  1. Using Case class
  2. Using StructType.

使用的火花版本为1.4

Spark Version used is 1.4

推荐答案

针对此类要求.我会提供case类方法来准备数据框

For this type of requirements. I'd offer case class approach to prepare a dataframe

是的,有一些局限性,例如生产率,但我们可以克服... 您可以像下面的< 2.11版:

Yes, There are some limitations like productarity but we can overcome... you can do like below example for < versions 2.11 :

准备一个extends Product并覆盖方法的案例类.

prepare a case class which extends Product and overrides methods.

喜欢...

  • productArity():Int:这将返回属性的大小.在我们的例子中,它是33.因此,我们的实现如下所示:

  • productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:

productElement(n:Int):Any:给定索引,这将返回属性.作为保护,我们还有一个默认情况,该情况会引发IndexOutOfBoundsException异常:

productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception:

canEqual (that:Any):Boolean:这是三个函数中的最后一个,当针对类进行相等性检查时,它用作边界条件:

canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:

  • Example implementation you can refer this Student case class which has 33 fields in it
  • Example student dataset description here

使用StructType定义架构并创建数据框.(如果您不想使用

Use the StructType to define the schema and create the dataframe.(if you don't want to use spark csv api)

这篇关于在Spark中为具有1500列的表定义DataFrame模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆