在Spark中为具有1500列的表定义DataFrame模式 [英] Defining DataFrame Schema for a table with 1500 columns in Spark
问题描述
我在SQL Server中有一个包含约1500列的表.我需要从该表中读取数据,然后将其转换为正确的数据类型格式,然后将记录插入Oracle DB.
I have a table with around 1500 columns in SQL Server. I need to read the data from this table and then convert it to proper datatype format and then insert the records into Oracle DB.
为表中包含1500列以上的这种类型的表定义架构的最佳方法是什么.除了对列名和数据类型进行硬编码之外,还有其他选择吗?
What is the best way to define the schema for this type of table with more than 1500 columns in a table. Is there any other option than hard coding the column names along with the datatypes?
- 使用
Case class
- 使用
StructType
.
- Using
Case class
- Using
StructType
.
使用的火花版本为1.4
Spark Version used is 1.4
推荐答案
针对此类要求.我会提供case
类方法来准备数据框
For this type of requirements. I'd offer case
class approach to prepare a dataframe
是的,有一些局限性,例如生产率,但我们可以克服... 您可以像下面的< 2.11版:
Yes, There are some limitations like productarity but we can overcome... you can do like below example for < versions 2.11 :
准备一个extends Product
并覆盖方法的案例类.
prepare a case class which extends Product
and overrides methods.
喜欢...
-
productArity():Int:
这将返回属性的大小.在我们的例子中,它是33.因此,我们的实现如下所示:
productArity():Int:
This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this:
productElement(n:Int):Any:
给定索引,这将返回属性.作为保护,我们还有一个默认情况,该情况会引发IndexOutOfBoundsException
异常:
productElement(n:Int):Any:
Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException
exception:
canEqual (that:Any):Boolean
:这是三个函数中的最后一个,当针对类进行相等性检查时,它用作边界条件:
canEqual (that:Any):Boolean
: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against class:
- 示例实现,您可以参考此处的学生数据集说明
- Example implementation you can refer this Student case class which has 33 fields in it
- Example student dataset description here
使用StructType
定义架构并创建数据框.(如果您不想使用
Use the StructType
to define the schema and create the dataframe.(if you don't want to use spark csv api)
这篇关于在Spark中为具有1500列的表定义DataFrame模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!