从 Spark DataFrame 中选择空数组值 [英] Selecting empty array values from a Spark DataFrame

查看:47
本文介绍了从 Spark DataFrame 中选择空数组值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个包含以下行的 DataFrame:

rows = [行(col1='abc', col2=[8], col3=[18], col4=[16]),行(col2='def', col2=[18], col3=[18], col4=[]),行(col3='ghi', col2=[], col3=[], col4=[])]

对于col2col3col4(即第三行),我想删除带有空数组的行.

例如,我可能希望这段代码能正常工作:

df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect()

我有两个问题

  1. 如何将 where 子句与 and 结合起来,但更重要的是...
  2. 如何判断数组是否为空.

那么,是否有内置函数来查询空数组?有没有一种优雅的方法将空数组强制转换为 nanull 值?

我试图避免使用 python 来解决它,无论是使用 UDF 还是 .map().

解决方案

如何结合 where 子句和

要在列上构造布尔表达式,您应该使用 &|~ 运算符,因此在您的情况下它应该是这样的

~lit(True) &〜点亮(假)

由于这些运算符的优先级高于复杂表达式的比较运算符,因此您必须使用括号:

(lit(1) > lit(2)) &(亮(3) > 亮(4))

<块引用>

如何判断数组是否为空.

我很确定没有 UDF 就没有优雅的方法来处理这个问题.我想你已经知道你可以像这样使用 Python UDF

isEmpty = udf(lambda x: len(x) == 0, BooleanType())

也可以使用 Hive UDF:

df.registerTempTable("df")查询 = "SELECT * FROM df WHERE {0}".format(" AND ".join("SIZE({0}) > 0".format(c) for c in ["col2", "col3", "col4"]))sqlContext.sql(查询)

想到的唯一可行的非 UDF 解决方案是强制转换为字符串

cols = [col(c).cast(StringType()) != lit("ArrayBuffer()")对于 ["col2", "col3", "col4"] 中的 c]cond = reduce(lambda x, y: x & y, cols)df.where(cond)

但它在一英里外就闻到了.

也可以使用countjoin来explode一个数组,groupByagg 但很可能在任何现实生活场景中都非常昂贵.

可能避免 UDF 和脏黑客的最佳方法是用 NULL 替换空数组.

Given a DataFrame with the following rows:

rows = [
    Row(col1='abc', col2=[8], col3=[18], col4=[16]),
    Row(col2='def', col2=[18], col3=[18], col4=[]),
    Row(col3='ghi', col2=[], col3=[], col4=[])]

I'd like to remove rows with an empty array for each of col2, col3 and col4 (i.e. the 3rd row).

For example I might expect this code to work:

df.where(~df.col2.isEmpty(), ~df.col3.isEmpty(), ~df.col4.isEmpty()).collect()

I have two problems

  1. how to combine where clauses with and but more importantly...
  2. how to determine if the array is empty.

So, is there a builtin function to query for empty arrays? Is there an elegant way to coerce an empty array to an na or null value?

I'm trying to avoid using python to solve it, either with a UDF or .map().

解决方案

how to combine where clauses with and

To construct boolean expressions on columns you should use &, | and ~ operators so in your case it should be something like this

~lit(True) & ~lit(False)

Since these operators have higher precedence than the comparison operators for complex expressions you'll have to use parentheses:

(lit(1) > lit(2)) & (lit(3) > lit(4))

how to determine if the array is empty.

I am pretty sure there is no elegant way to handle this without an UDF. I guess you already know you can use a Python UDF like this

isEmpty = udf(lambda x: len(x) == 0, BooleanType())

It is also possible to use a Hive UDF:

df.registerTempTable("df")
query = "SELECT * FROM df WHERE {0}".format(
  " AND ".join("SIZE({0}) > 0".format(c) for c in ["col2", "col3", "col4"]))

sqlContext.sql(query)

Only feasible non-UDF solution that comes to mind is to cast to string

cols = [
    col(c).cast(StringType()) != lit("ArrayBuffer()")
    for c in  ["col2", "col3", "col4"]
]
cond = reduce(lambda x, y: x & y, cols)
df.where(cond)

but it smells from a mile away.

It is also possible to explode an array, groupBy, agg using count and join but is most likely far to expensive to be useful in any real life scenario.

Probably the best approach to avoid UDFs and dirty hacks is to replace empty arrays with NULL.

这篇关于从 Spark DataFrame 中选择空数组值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆