Spark将列转换为存储在字符串中的sql类型 [英] Spark cast column to sql type stored in string
问题描述
简单的请求是我需要帮助将一列添加到数据框中,但是,该列必须为空,其类型来自 ...spark.sql.types,并且必须从字符串中定义类型.
>我可能可以用 ifs 或 case 来做到这一点,但我正在寻找更优雅的东西.不需要为 org.apache.spark.sql.types 中的每种类型编写案例的东西
如果我这样做:
df = df.withColumn("col_name", lit(null).cast(org.apache.spark.sql.types.StringType))
它按预期工作,但我将类型存储为字符串,
var the_type = "StringType"
或var the_type = "org.apache.spark.sql.types.StringType"
我无法通过从字符串定义类型来使其工作.
对于那些感兴趣的人,这里有一些更多的细节:我有一个包含元组(col_name,col_type)作为字符串的集合,我需要添加具有正确类型的列,以便将来在 2 个数据帧之间建立联合.
我目前有这个:
for (i <- set_of_col_type_tuples) yield {val tip = Class.forName("org.apache.spark.sql.types."+i._2)df = df.withColumn(i._1, lit(null).cast(the_type))df }
如果我使用
val the_type = Class.forName("org.apache.spark.sql.types."+i._2)
我明白
错误:重载的方法值转换为替代:(to: String)org.apache.spark.sql.Column <and>(to: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column 不能应用到 (Class[?0])
如果我使用
val the_type = Class.forName("org.apache.spark.sql.types."+i._2).getName()
这是一个字符串,所以我得到:
org.apache.spark.sql.catalyst.parser.ParseException: 不匹配的输入."期待 {<EOF>, '('}(line 1, pos 3)== SQL == org.apache.spark.sql.types.StringType---^^^
所以,为了清楚起见,该集合包含这样的元组 ("col1","IntegerType"), ("col2","StringType") 而不是 ("col1","int"), ("col2","字符串").简单的强制转换 (i._2) 不起作用.
谢谢.
你可以使用重载方法 cast
,它有一个 String 作为参数:
val stringType : String = ...column.cast(stringType)
<块引用>
def cast(to: String): 列
使用规范字符串将列转换为不同的数据类型类型的表示.
您还可以扫描所有数据类型:
val types = classOf[DataTypes].getDeclaredFields().filter(f => java.lang.reflect.Modifier.isStatic(f.getModifiers())).map(f => f.get(new DataTypes()).asInstanceOf[DataType])
现在的类型是 Array[DataType].你可以把它翻译成地图:
val typeMap = types.map(t => (t.getClass.getSimpleName.replace("$", ""), t)).toMap
并在代码中使用:
column.cast(typeMap(yourType))
The simple request is I need help adding a column to a dataframe but, the column has to be empty, its type is from ...spark.sql.types and the type has to be defined from a string.
I can probably do this with ifs or case but I'm looking for something more elegant. Something that does not require writing a case for every type in org.apache.spark.sql.types
If I do this for example:
df = df.withColumn("col_name", lit(null).cast(org.apache.spark.sql.types.StringType))
It works as intended, but I have the type stored as a string,
var the_type = "StringType"
or var the_type = "org.apache.spark.sql.types.StringType"
and I can't get it to work by defining the type from the string.
For those interested here are some more details: I have a set containing tuples (col_name, col_type) both as strings and I need to add columns with the correct types for a future union between 2 dataframes.
I currently have this:
for (i <- set_of_col_type_tuples) yield {
val tip = Class.forName("org.apache.spark.sql.types."+i._2)
df = df.withColumn(i._1, lit(null).cast(the_type))
df }
if I use
val the_type = Class.forName("org.apache.spark.sql.types."+i._2)
I get
error: overloaded method value cast with alternatives: (to: String)org.apache.spark.sql.Column <and> (to: org.apache.spark.sql.types.DataType)org.apache.spark.sql.Column cannot be applied to (Class[?0])
if I use
val the_type = Class.forName("org.apache.spark.sql.types."+i._2).getName()
It's a string so I get:
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '.' expecting {<EOF>, '('}(line 1, pos 3)
== SQL == org.apache.spark.sql.types.StringType
---^^^
EDIT: So, just to be clear, the set contains tuples like this ("col1","IntegerType"), ("col2","StringType") not ("col1","int"), ("col2","string"). A simple cast(i._2) does not work.
Thank you.
You can use overloaded method cast
, which has a String as an argument:
val stringType : String = ...
column.cast(stringType)
def cast(to: String): Column
Casts the column to a different data type, using the canonical string representation of the type.
You can also scan for all Data Types:
val types = classOf[DataTypes]
.getDeclaredFields()
.filter(f => java.lang.reflect.Modifier.isStatic(f.getModifiers()))
.map(f => f.get(new DataTypes()).asInstanceOf[DataType])
Now types is Array[DataType]. You can translate it to Map:
val typeMap = types.map(t => (t.getClass.getSimpleName.replace("$", ""), t)).toMap
and use in code:
column.cast(typeMap(yourType))
这篇关于Spark将列转换为存储在字符串中的sql类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!