如何使用PySpark在数据框中删除基于多个过滤器的列? [英] How to drop columns based on multiple filters in a dataframe using PySpark?

查看:97
本文介绍了如何使用PySpark在数据框中删除基于多个过滤器的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个单元格可以具有的有效值的列表.如果列中的一个单元格无效,则需要删除整列.我知道有删除特定列中行的答案,但是在这里我将删除整列,即使其中的一个单元格无效.有效/无效的条件是一个单元格只能具有三个值:['Messi', 'Ronaldo', 'Virgil']

I have a list of valid values that a cell can have. If one cell in a column is invalid, I need to drop the whole column. I understand there are answers of dropping rows in a particular column but here I am dropping the whole column instead even if one cell in it is invalid. The conditions for valid/invalid are that a cell can have only three values: ['Messi', 'Ronaldo', 'Virgil']

我尝试阅读有关过滤的内容,但我只能看到过滤列并删除行.例如,在问题中.我还读到,我应该避免在Spark中进行过多的扫描和改组,这是我同意的.

I tried reading about filtering but all I could see was filtering columns and dropping the rows. For instance in this question. I also read that one should avoid too much scanning and shuffling in Spark, which I agree with.

我不仅在看代码解决方案,而且在看PySpark提供的现成代码.我希望它不会超出SO答案的范围.

I am not only looking at the code solution but more on the off-the-shelf code provided from PySpark. I hope it doesn't get out of the scope of a SO answer.

对于以下输入数据框:

| Column 1      | Column 2      | Column 3      | Column 4      | Column 5      |
| --------------| --------------| --------------| --------------| --------------|
|  Ronaldo      | Salah         |  Messi        |               |Salah          |
|  Ronaldo      | Messi         |  Virgil       |  Messi        | null          |
|  Ronaldo      | Ronaldo       |  Messi        |  Ronaldo      | null          |

我希望得到以下输出:

| Column 1      | Column 2      |
| --------------| --------------| 
|  Ronaldo      | Messi         |
|  Ronaldo      | Virgil        |
|  Ronaldo      | Messi         |

推荐答案

我不仅在查看代码解决方案,还在看PySpark提供的现成代码.

I am not only looking at the code solution but more on the off-the-shelf code provided from PySpark.

不幸的是,Spark设计为可以逐行并行运行.过滤出列并不是什么现成的代码"解决方案.

Unfortunately, Spark is designed to operate in parallel on a row-by-row basis. Filtering out columns is not something for which there will be an "off-the-shelf code" solution.

不过,这是您可以采用的一种方法:

Nevertheless, here is one approach you can take:

首先收集每列中无效元素的计数.

First collect the counts of the invalid elements in each column.

from pyspark.sql.functions import col, lit, sum as _sum, when

valid = ['Messi', 'Ronaldo', 'Virgil']
invalid_counts = df.select(
    *[_sum(when(col(c).isin(valid), lit(0)).otherwise(lit(1))).alias(c) for c in df.columns]
).collect()
print(invalid_counts)
#[Row(Column 1=0, Column 2=1, Column 3=0, Column 4=1, Column 5=3)]

此输出将是仅包含一个元素的列表.您可以遍历此元素中的项目以查找要保留的列.

This output will be a list with only one element. You can iterate over the items in this element to find the columns to keep.

valid_columns = [k for k,v in invalid_counts[0].asDict().items() if v == 0]
print(valid_columns)
#['Column 3', 'Column 1']

现在,只需从原始DataFrame中选择这些列即可.如果要保持原始列顺序,可以先使用list.indexvalid_columns进行排序.

Now just select these columns from your original DataFrame. You can first sort valid_columns using list.index if you want to maintain the original column order.

valid_columns = sorted(valid_columns, key=df.columns.index)
df.select(valid_columns).show()
#+--------+--------+
#|Column 1|Column 3|
#+--------+--------+
#| Ronaldo|   Messi|
#| Ronaldo|  Virgil|
#| Ronaldo|   Messi|
#+--------+--------+

这篇关于如何使用PySpark在数据框中删除基于多个过滤器的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆