Spark:在不更改列的可为空属性的情况下转换十进制 [英] Spark: cast decimal without changing nullable property of column

查看:38
本文介绍了Spark:在不更改列的可为空属性的情况下转换十进制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将列强制转换为 DataFrame 中的 DecimalType 似乎会更改可空属性.具体来说,我有一个 DecimalType(12, 4) 类型的不可为空的列,我正在使用 df 将它转换为 DecimalType(38, 9).withColumn(columnName, df.col(columnName).cast(dataType)).这会产生一个具有预期数据类型的字段,但该字段现在可以为空.有没有办法在不改变列的可为空属性的情况下进行转换?

Casting a column to a DecimalType in a DataFrame seems to change the nullable property. Specifically, I have a non-nullable column of type DecimalType(12, 4) and I'm casting it to DecimalType(38, 9) using df.withColumn(columnName, df.col(columnName).cast(dataType)). This results in a field with the expected data type, but the field is now nullable. Is there a way to cast without changing the nullable property of a column?

我在 Spark 2.2.1 和 Spark 2.3.0 中都观察到了这种行为.

I observe this behavior in both Spark 2.2.1 and Spark 2.3.0.

推荐答案

感谢您提出有趣的观点.我深入研究了源代码以了解这种行为,而 IMO 的答案是在 Cast.scala 中表示强制转换表达式.公开可空性的属性是这样计算的:

Thanks for an interesting point. I dug a little into source code to understand this behavior and IMO the answer is in Cast.scala representing cast expression. The property exposing nullability is computed like that:

override def nullable: Boolean = Cast.forceNullable(child.dataType, dataType) || child.nullable

  def forceNullable(from: DataType, to: DataType): Boolean = (from, to) match {
  case (NullType, _) => true
  case (_, _) if from == to => false

  case (StringType, BinaryType) => false
  case (StringType, _) => true
  case (_, StringType) => false

  case (FloatType | DoubleType, TimestampType) => true
  case (TimestampType, DateType) => false
  case (_, DateType) => true
  case (DateType, TimestampType) => false
  case (DateType, _) => true
  case (_, CalendarIntervalType) => true

  case (_, _: DecimalType) => true  // overflow
  case (_: FractionalType, _: IntegralType) => true  // NaN, infinity
  case _ => false
}

如您所见,从任何类型到 DecimalType 的转换总是返回可空类型.我想知道为什么,这可能是因为这里表达的溢出风险:

As you can see, the conversion from any type to DecimalType always returns a nullable type. I was wondering why and it's probably because of the risk of overflow that is expressed here:

/**
 * Change the precision / scale in a given decimal to those set in `decimalType` (i  f any),
 * returning null if it overflows or modifying `value` in-place and returning it if successful.
 *
 * NOTE: this modifies `value` in-place, so don't call it on external data.
 */
private[this] def changePrecision(value: Decimal, decimalType: DecimalType): Decimal = {
  if (value.changePrecision(decimalType.precision,   decimalType.scale)) value else null
}

changePrecision 方法依次检查精度是否可以修改,如果是,则返回 true,否则返回 false.它解释了为什么上述方法可以返回 null 以及为什么 DecimalType 在独立于源类型强制转换时默认设置为可空.

changePrecision method in its turn checks if the precision can be modified returning true if yes, false otherwise. It explains why above method can return null and hence why DecimalType, when casted independently on source type, is set to nullable by default.

由于 IMO,没有简单的方法来保持原始列的可空性.也许您可以尝试查看 UserDefinedTypes 并构建自己的、保留源属性的 DecimalType ?但 IMO 存在可空性并非没有原因,我们会尊重这一点,以避免在管道中迟早出现一些糟糕的意外.

Because of that IMO there is no simple way to keep the nullability of the original column. Maybe you could try to take a look at UserDefinedTypes and built your own, source-properties-keeping, DecimalType ? But IMO the nullability is there not without the reason and we'd respect that to avoid some bad surprises soon or later in the pipeline.

这篇关于Spark:在不更改列的可为空属性的情况下转换十进制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆