Spark数据框过滤器操作 [英] spark dataframe filter operation

查看:85
本文介绍了Spark数据框过滤器操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个spark数据框,然后应用过滤器字符串,过滤器仅选择了一些行,但我想知道未选择行的原因. 示例:

I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected. Example:

DataFrame列:customer_id|col_a|col_b|col_c|col_d

DataFrame columns: customer_id|col_a|col_b|col_c|col_d

过滤字符串:col_a > 0 & col_b > 4 & col_c < 0 & col_d=0

等...

reason_for_exclusion可以是任何字符串或字母,只要能说明为什么排除特定行即可.

reason_for_exclusion can be any string or letter as long as it says why particular row excluded.

我可以拆分过滤器字符串并应用每个过滤器,但是我的过滤器字符串很大,效率很低,所以只需检查是否有更好的方法可以进行此操作?

I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so just checking is there any better way to this operation?

谢谢

推荐答案

您将不得不检查过滤器表达式中的每个条件,这对于简单的过滤操作而言可能是昂贵的. 我建议对所有筛选的行显示相同的原因,因为它满足该表达式中的至少一个条件.它虽然不漂亮,但我更喜欢这样做,因为它效率很高,尤其是当您必须处理非常大的DataFrame时.

You'll have to check for each condition within the filter expression which can be expensive regarding the simple operation of filtering. I would suggest displaying the same reason for all filtred rows since it satisfies at least one condition in that expression. It's not pretty but I'd prefer this as it's efficient especially when you have to handle very large DataFrames.

data = [(1, 1, 5, -3, 0),(2, 0, 10, -1, 0), (3, 0, 10, -4, 1),]
df = spark.createDataFrame(data, ["customer_id", "col_a", "col_b", "col_c", "col_d"])

filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"

filtered_df = df.withColumn("reason_for_exclusion", 
                            when(~expr(filter_expr),lit(filter_expr)
                                ).otherwise(lit(None))
                            )
filtered_df.show(truncate=False)

输出:

+-----------+-----+-----+-----+-----+-------------------------------------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion                             |
+-----------+-----+-----+-----+-----+-------------------------------------------------+
|1          |1    |5    |-3   |0    |null                                             |
|2          |0    |10   |-1   |0    |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
|3          |0    |10   |-4   |1    |col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0|
+-----------+-----+-----+-----+-----+-------------------------------------------------+

现在,如果您真的只想显示失败的条件,则可以将每个条件转到单独的列,然后使用DataFrame select进行计算.然后,您必须检查评估为False的列,以了解哪个条件失败了.

Now, if you really want to display only the conditions which failed you can turn each condition to separated columns and use DataFrame select to do the calculation. Then you'll have to check columns evaluated to False to know which condition has failed.

您可以用<PREFIX>_<condition>命名这些列,以便以后可以轻松识别它们.这是一个完整的示例:

You could name these columns by <PREFIX>_<condition> so that you could identify them easily later. Here is a complete example:

filter_expr = "col_a > 0 AND col_b > 4 AND col_c < 0 AND col_d=0"
COLUMN_FILTER_PREFIX = "filter_validation_"
original_columns = [col(c) for c in df.columns]

# create column for each condition in filter expression
condition_columns = [expr(f).alias(COLUMN_FILTER_PREFIX + f) for f in filter_expr.split("AND")]

# evaluate condition to True/False and persist the DF with calculated columns
filtered_df = df.select(original_columns + condition_columns)
filtered_df = filtered_df.persist(StorageLevel.MEMORY_AND_DISK)

# get back columns we calculated for filter
filter_col_names = [c for c in filtered_df.columns if COLUMN_FILTER_PREFIX in c]
filter_columns = list()
for c in filter_col_names:
    filter_columns.append(
        when(~col(f"`{c}`"),
             lit(f"{c.replace(COLUMN_FILTER_PREFIX, '')}")
             )
    )
array_reason_filter = array_except(array(*filter_columns), array(lit(None)))
df_with_filter_reason = filtered_df.withColumn("reason_for_exclusion", array_reason_filter)

df_with_filter_reason.select(*original_columns, col("reason_for_exclusion")).show(truncate=False)

# output
+-----------+-----+-----+-----+-----+----------------------+
|customer_id|col_a|col_b|col_c|col_d|reason_for_exclusion  |
+-----------+-----+-----+-----+-----+----------------------+
|1          |1    |5    |-3   |0    |[]                    |
|2          |0    |10   |-1   |0    |[col_a > 0 ]          |
|3          |0    |10   |-4   |1    |[col_a > 0 ,  col_d=0]|
+-----------+-----+-----+-----+-----+----------------------+

这篇关于Spark数据框过滤器操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆