应用于数据帧中空数组列的size函数在溢出后返回1 [英] size function applied to empty array column in dataframe returns 1 after spilt
问题描述
使用以下代码在数据框中的数组列上使用 size
函数对其进行了注释,其中包括 split
:
Noticed that with size
function on an array column in a dataframe using following code - which includes a split
:
import org.apache.spark.sql.functions.{trim, explode, split, size}
val df1 = Seq(
(1, "[{a},{b},{c}]"),
(2, "[]"),
(3, "[{d},{e},{f}]")
).toDF("col1", "col2")
df1.show(false)
val df2 = df.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", size($"cola"))
df2.show(false)
我们得到:
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |1 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+
我希望输入零,以便能够区分0或1个条目.
I was hoping for a zero so as to be able distinguish between 0 or 1 entries.
关于SO的一些提示,但没有帮助.
A few hints here and there on SO, but none that helped.
如果我输入以下内容:(2,null)
,那么我得到的大小为-1,我猜这会更有用.
If I have the following entry: (2, null)
, then I get size -1, which is more helpful I guess.
另一方面,这是从互联网上借来的样本:
On the other hand, this borrowed sample from the internet:
val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
val df3 = df2.withColumn("s", size($"numbers"))
df3.show()
返回0-如预期.
在这里寻找正确的方法以使大小= 0.
Looking for the correct approach here so as to get size = 0.
推荐答案
我认为根本原因是 split
返回一个空字符串,而不是null.
I suppose the root cause is that split
returns an empty string, instead of a null.
scala> df1.withColumn("cola", split(trim($"col2", "[]"), ",")).withColumn("s", $"cola"(0)).select("s").collect()(1)(0)
res53: Any = ""
当然,包含空字符串的数组的大小为1.
And the size of an array containing an empty string is, of course, 1.
要解决这个问题,也许您可以
To get around this, perhaps you could do
val df2 = df1.withColumn("cola", split(trim($"col2", "[]"), ","))
.withColumn("s", when(length($"cola"(0)) =!= 0, size($"cola"))
.otherwise(lit(0)))
df2.show(false)
+----+-------------+---------------+---+
|col1|col2 |cola |s |
+----+-------------+---------------+---+
|1 |[{a},{b},{c}]|[{a}, {b}, {c}]|3 |
|2 |[] |[] |0 |
|3 |[{d},{e},{f}]|[{d}, {e}, {f}]|3 |
+----+-------------+---------------+---+
这篇关于应用于数据帧中空数组列的size函数在溢出后返回1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!