从Spark DataFrame选择空数组值 [英] Selecting empty array values from a Spark DataFrame
问题描述
给出一个包含以下行的DataFrame:
Given a DataFrame with the following rows:
rows = [
Row(col1='abc', col2=[8], col3=[18], col4=[16]),
Row(col2='def', col2=[18], col3=[18], col4=[]),
Row(col3='ghi', col2=[], col3=[], col4=[])]
我想删除col2
,col3
和col4
的每一个都具有空数组的行(即第三行).
I'd like to remove rows with an empty array for each of col2
, col3
and col4
(i.e. the 3rd row).
例如,我可能希望这段代码能正常工作:
For example I might expect this code to work:
df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect()
我有两个问题
- 如何将where子句与
and
组合在一起,但更重要的是... - 如何确定数组是否为空.
- how to combine where clauses with
and
but more importantly... - how to determine if the array is empty.
那么,是否有内置函数来查询空数组?是否有一种优雅的方法可以将空数组强制为na
或null
值?
So, is there a builtin function to query for empty arrays? Is there an elegant way to coerce an empty array to an na
or null
value?
我正在尝试避免使用UDF或.map()
来解决python问题.
I'm trying to avoid using python to solve it, either with a UDF or .map()
.
推荐答案
如何将where子句与和
how to combine where clauses with and
要在列上构造布尔表达式,您应该使用&
,|
和~
运算符,因此在您的情况下,应该是这样的
To construct boolean expressions on columns you should use &
, |
and ~
operators so in your case it should be something like this
~lit(True) & ~lit(False)
由于对于复杂表达式,这些运算符的优先级高于比较运算符,因此必须使用括号:
Since these operators have higher precedence than the comparison operators for complex expressions you'll have to use parentheses:
(lit(1) > lit(2)) & (lit(3) > lit(4))
如何确定数组是否为空.
how to determine if the array is empty.
我敢肯定,没有UDF,没有优雅的方法可以解决这个问题.我想您已经知道您可以使用这样的Python UDF
I am pretty sure there is no elegant way to handle this without an UDF. I guess you already know you can use a Python UDF like this
isEmpty = udf(lambda x: len(x) == 0, BooleanType())
还可以使用Hive UDF:
It is also possible to use a Hive UDF:
df.registerTempTable("df")
query = "SELECT * FROM df WHERE {0}".format(
" AND ".join("SIZE({0}) > 0".format(c) for c in ["col2", "col3", "col4"]))
sqlContext.sql(query)
想到的唯一可行的非UDF解决方案是强制转换为字符串
Only feasible non-UDF solution that comes to mind is to cast to string
cols = [
col(c).cast(StringType()) != lit("ArrayBuffer()")
for c in ["col2", "col3", "col4"]
]
cond = reduce(lambda x, y: x & y, cols)
df.where(cond)
但是闻起来有一英里远.
but it smells from a mile away.
还可以使用count
和join
来explode
数组,groupBy
,agg
,但是在任何现实生活中使用都非常昂贵.
It is also possible to explode
an array, groupBy
, agg
using count
and join
but is most likely far to expensive to be useful in any real life scenario.
避免UDF和肮脏的黑客攻击的最佳方法可能是用NULL
替换空数组.
Probably the best approach to avoid UDFs and dirty hacks is to replace empty arrays with NULL
.
这篇关于从Spark DataFrame选择空数组值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!