应用于数据帧中空数组列的size函数在溢出后返回1 [英] size function applied to empty array column in dataframe returns 1 after spilt

查看:20
本文介绍了应用于数据帧中空数组列的size函数在溢出后返回1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用以下代码在数据框中的数组列上使用 size 函数对其进行了注释,其中包括 split :

Noticed that with size function on an array column in a dataframe using following code - which includes a split:

import org.apache.spark.sql.functions.{trim, explode, split, size}

val df1 = Seq(
  (1, "[{a},{b},{c}]"),
  (2, "[]"),
  (3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)

val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)

我们得到:

+----+-------------+---------------+---+
|col1|col2         |cola           |s  |
+----+-------------+---------------+---+
|1   |[{a},{b},{c}]|[{a}, {b}, {c}]|3  |
|2   |[]           |[]             |1  |
|3   |[{d},{e},{f}]|[{d}, {e}, {f}]|3  |
+----+-------------+---------------+---+

我希望输入零,以便能够区分0或1个条目.

I was hoping for a zero so as to be able distinguish between 0 or 1 entries.

关于SO的一些提示,但没有帮助.

A few hints here and there on SO, but none that helped.

如果我输入以下内容:(2,null),那么我得到的大小为-1,我猜这会更有用.

If I have the following entry: (2, null), then I get size -1, which is more helpful I guess.

另一方面,这是从互联网上借来的样本:

On the other hand, this borrowed sample from the internet:

val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()

返回0-如预期.

在这里寻找正确的方法以使大小= 0.

Looking for the correct approach here so as to get size = 0.

推荐答案

我认为根本原因是 split 返回一个空字符串,而不是null.

I suppose the root cause is that split returns an empty string, instead of a null.

scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""

当然,包含空字符串的数组的大小为1.

And the size of an array containing an empty string is, of course, 1.

要解决这个问题,也许您可​​以

To get around this, perhaps you could do

val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
             .withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
                              .otherwise(lit(0)))

df2.show(false)
+----+-------------+---------------+---+
|col1|col2         |cola           |s  |
+----+-------------+---------------+---+
|1   |[{a},{b},{c}]|[{a}, {b}, {c}]|3  |
|2   |[]           |[]             |0  |
|3   |[{d},{e},{f}]|[{d}, {e}, {f}]|3  |
+----+-------------+---------------+---+

这篇关于应用于数据帧中空数组列的size函数在溢出后返回1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆