从 Spark DataFrame 中选择空数组值 [英] Selecting empty array values from a Spark DataFrame
问题描述
给定一个包含以下行的 DataFrame:
rows = [行(col1='abc', col2=[8], col3=[18], col4=[16]),行(col2='def', col2=[18], col3=[18], col4=[]),行(col3='ghi', col2=[], col3=[], col4=[])]
对于col2
、col3
和col4
(即第三行),我想删除带有空数组的行.
例如,我可能希望这段代码能正常工作:
df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect()
我有两个问题
- 如何将 where 子句与
and
结合起来,但更重要的是... - 如何判断数组是否为空.
那么,是否有内置函数来查询空数组?有没有一种优雅的方法将空数组强制转换为 na
或 null
值?
我试图避免使用 python 来解决它,无论是使用 UDF 还是 .map()
.
如何结合 where 子句和
要在列上构造布尔表达式,您应该使用 &
、|
和 ~
运算符,因此在您的情况下它应该是这样的
~lit(True) &〜点亮(假)
由于这些运算符的优先级高于复杂表达式的比较运算符,因此您必须使用括号:
(lit(1) > lit(2)) &(亮(3) > 亮(4))
<块引用>
如何判断数组是否为空.
我很确定没有 UDF 就没有优雅的方法来处理这个问题.我想你已经知道你可以像这样使用 Python UDF
isEmpty = udf(lambda x: len(x) == 0, BooleanType())
也可以使用 Hive UDF:
df.registerTempTable("df")查询 = "SELECT * FROM df WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["col2", "col3", "col4"]))sqlContext.sql(查询)
想到的唯一可行的非 UDF 解决方案是强制转换为字符串
cols = [col(c).cast(StringType()) != lit("ArrayBuffer()")对于 ["col2", "col3", "col4"] 中的 c]cond = reduce(lambda x, y: x & y, cols)df.where(cond)
但它在一英里外就闻到了.
也可以使用count
和join来
explode
一个数组,groupBy
,agg
但很可能在任何现实生活场景中都非常昂贵.
可能避免 UDF 和脏黑客的最佳方法是用 NULL
替换空数组.
Given a DataFrame with the following rows:
rows = [
Row(col1='abc', col2=[8], col3=[18], col4=[16]),
Row(col2='def', col2=[18], col3=[18], col4=[]),
Row(col3='ghi', col2=[], col3=[], col4=[])]
I'd like to remove rows with an empty array for each of col2
, col3
and col4
(i.e. the 3rd row).
For example I might expect this code to work:
df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect()
I have two problems
- how to combine where clauses with
and
but more importantly... - how to determine if the array is empty.
So, is there a builtin function to query for empty arrays? Is there an elegant way to coerce an empty array to an na
or null
value?
I'm trying to avoid using python to solve it, either with a UDF or .map()
.
how to combine where clauses with and
To construct boolean expressions on columns you should use &
, |
and ~
operators so in your case it should be something like this
~lit(True) & ~lit(False)
Since these operators have higher precedence than the comparison operators for complex expressions you'll have to use parentheses:
(lit(1) > lit(2)) & (lit(3) > lit(4))
how to determine if the array is empty.
I am pretty sure there is no elegant way to handle this without an UDF. I guess you already know you can use a Python UDF like this
isEmpty = udf(lambda x: len(x) == 0, BooleanType())
It is also possible to use a Hive UDF:
df.registerTempTable("df")
query = "SELECT * FROM df WHERE {0}".format(
" AND ".join("SIZE({0}) > 0".format(c) for c in ["col2", "col3", "col4"]))
sqlContext.sql(query)
Only feasible non-UDF solution that comes to mind is to cast to string
cols = [
col(c).cast(StringType()) != lit("ArrayBuffer()")
for c in ["col2", "col3", "col4"]
]
cond = reduce(lambda x, y: x & y, cols)
df.where(cond)
but it smells from a mile away.
It is also possible to explode
an array, groupBy
, agg
using count
and join
but is most likely far to expensive to be useful in any real life scenario.
Probably the best approach to avoid UDFs and dirty hacks is to replace empty arrays with NULL
.
这篇关于从 Spark DataFrame 中选择空数组值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!