从Spark DataFrame选择空数组值 [英] Selecting empty array values from a Spark DataFrame

查看:91
本文介绍了从Spark DataFrame选择空数组值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个包含以下行的DataFrame:

Given a DataFrame with the following rows:

rows = [
    Row(col1='abc', col2=[8], col3=[18], col4=[16]),
    Row(col2='def', col2=[18], col3=[18], col4=[]),
    Row(col3='ghi', col2=[], col3=[], col4=[])]

我想删除col2col3col4的每一个都具有空数组的行(即第三行).

I'd like to remove rows with an empty array for each of col2, col3 and col4 (i.e. the 3rd row).

例如,我可能希望这段代码能正常工作:

For example I might expect this code to work:

df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect()

我有两个问题

  1. 如何将where子句与and组合在一起,但更重要的是...
  2. 如何确定数组是否为空.
  1. how to combine where clauses with and but more importantly...
  2. how to determine if the array is empty.

那么,是否有内置函数来查询空数组?是否有一种优雅的方法可以将空数组强制为nanull值?

So, is there a builtin function to query for empty arrays? Is there an elegant way to coerce an empty array to an na or null value?

我正在尝试避免使用UDF或.map()来解决python问题.

I'm trying to avoid using python to solve it, either with a UDF or .map().

推荐答案

如何将where子句与和

how to combine where clauses with and

要在列上构造布尔表达式,您应该使用&|~运算符,因此在您的情况下,应该是这样的

To construct boolean expressions on columns you should use &, | and ~ operators so in your case it should be something like this

~lit(True) & ~lit(False)

由于对于复杂表达式,这些运算符的优先级高于比较运算符,因此必须使用括号:

Since these operators have higher precedence than the comparison operators for complex expressions you'll have to use parentheses:

(lit(1) > lit(2)) & (lit(3) > lit(4))

如何确定数组是否为空.

how to determine if the array is empty.

我敢肯定,没有UDF,没有优雅的方法可以解决这个问题.我想您已经知道您可以使用这样的Python UDF

I am pretty sure there is no elegant way to handle this without an UDF. I guess you already know you can use a Python UDF like this

isEmpty = udf(lambda x: len(x) == 0, BooleanType())

还可以使用Hive UDF:

It is also possible to use a Hive UDF:

df.registerTempTable("df")
query = "SELECT * FROM df WHERE {0}".format(
  " AND ".join("SIZE({0}) > 0".format(c) for c in ["col2", "col3", "col4"]))

sqlContext.sql(query)

想到的唯一可行的非UDF解决方案是强制转换为字符串

Only feasible non-UDF solution that comes to mind is to cast to string

cols = [
    col(c).cast(StringType()) != lit("ArrayBuffer()")
    for c in  ["col2", "col3", "col4"]
]
cond = reduce(lambda x, y: x & y, cols)
df.where(cond)

但是闻起来有一英里远.

but it smells from a mile away.

还可以使用countjoinexplode数组,groupByagg,但是在任何现实生活中使用都非常昂贵.

It is also possible to explode an array, groupBy, agg using count and join but is most likely far to expensive to be useful in any real life scenario.

避免UDF和肮脏的黑客攻击的最佳方法可能是用NULL替换空数组.

Probably the best approach to avoid UDFs and dirty hacks is to replace empty arrays with NULL.

这篇关于从Spark DataFrame选择空数组值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆