Spark:在不更改列的可空属性的情况下强制转换为十进制 [英] Spark: cast decimal without changing nullable property of column

查看:110
本文介绍了Spark:在不更改列的可空属性的情况下强制转换为十进制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将列投射到DataFrame中的DecimalType似乎会更改可为空的属性.具体来说,我有一个类型为DecimalType(12, 4)的不可为空的列,并使用df.withColumn(columnName, df.col(columnName).cast(dataType))将其强制转换为DecimalType(38, 9).这将导致字段具有预期的数据类型,但是该字段现在可以为空.有没有一种方法可以在不更改列的nullable属性的情况下进行投射?

Casting a column to a DecimalType in a DataFrame seems to change the nullable property. Specifically, I have a non-nullable column of type DecimalType(12, 4) and I'm casting it to DecimalType(38, 9) using df.withColumn(columnName, df.col(columnName).cast(dataType)). This results in a field with the expected data type, but the field is now nullable. Is there a way to cast without changing the nullable property of a column?

我在Spark 2.2.1和Spark 2.3.0中都观察到了这种行为.

I observe this behavior in both Spark 2.2.1 and Spark 2.3.0.

推荐答案

感谢您提出一个有趣的观点.我对源代码进行了一些了解,以了解这种行为,而IMO的答案是在Cast.scala中表示强制转换表达式.暴露可空性的属性是这样计算的:

Thanks for an interesting point. I dug a little into source code to understand this behavior and IMO the answer is in Cast.scala representing cast expression. The property exposing nullability is computed like that:

override def nullable: Boolean = Cast.forceNullable(child.dataType, dataType) || child.nullable

  def forceNullable(from: DataType, to: DataType): Boolean = (from, to) match {
  case (NullType, _) => true
  case (_, _) if from == to => false

  case (StringType, BinaryType) => false
  case (StringType, _) => true
  case (_, StringType) => false

  case (FloatType | DoubleType, TimestampType) => true
  case (TimestampType, DateType) => false
  case (_, DateType) => true
  case (DateType, TimestampType) => false
  case (DateType, _) => true
  case (_, CalendarIntervalType) => true

  case (_, _: DecimalType) => true  // overflow
  case (_: FractionalType, _: IntegralType) => true  // NaN, infinity
  case _ => false
}

如您所见,从任何类型到DecimalType的转换总是返回可为空的类型.我想知道为什么,这可能是因为此处表示存在溢出风险:

As you can see, the conversion from any type to DecimalType always returns a nullable type. I was wondering why and it's probably because of the risk of overflow that is expressed here:

/**
 * Change the precision / scale in a given decimal to those set in `decimalType` (i  f any),
 * returning null if it overflows or modifying `value` in-place and returning it if successful.
 *
 * NOTE: this modifies `value` in-place, so don't call it on external data.
 */
private[this] def changePrecision(value: Decimal, decimalType: DecimalType): Decimal = {
  if (value.changePrecision(decimalType.precision,   decimalType.scale)) value else null
}

changePrecision方法依次检查是否可以修改精度,如果是则返回true,否则返回false.解释了为什么上述方法可以返回null,因此为什么将DecimalType在源类型上独立转换时默认设置为可空.

changePrecision method in its turn checks if the precision can be modified returning true if yes, false otherwise. It explains why above method can return null and hence why DecimalType, when casted independently on source type, is set to nullable by default.

由于该IMO,没有简单的方法来保持原始列的可为空性.也许您可以尝试看看UserDefinedTypes并构建自己的,源属性保留的DecimalType?但是,IMO并非没有原因就存在可空性,因此我们谨此避免在管道中早晚出现一些严重的意外情况.

Because of that IMO there is no simple way to keep the nullability of the original column. Maybe you could try to take a look at UserDefinedTypes and built your own, source-properties-keeping, DecimalType ? But IMO the nullability is there not without the reason and we'd respect that to avoid some bad surprises soon or later in the pipeline.

这篇关于Spark:在不更改列的可空属性的情况下强制转换为十进制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆