通过是否列值筛选等于火花列表 [英] Filter by whether column value equals a list in spark

查看:162
本文介绍了通过是否列值筛选等于火花列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图筛选基于列中的值是否等于列表中的数据框火花。我愿做这样的事情:

I'm trying to filter a spark dataframe based on whether the values in a column equal a list. I would like to do something like this:

filtered_df = df.where(df.a == ['list','of' , 'stuff'])

其中, filtered_df 只包含行,其中的值 filtered_df.a ['名单','中','东西'] 和类型 A 阵列(可为空=真)

Where filtered_df only contains rows where the value of filtered_df.a is ['list','of' , 'stuff'] and the type of a is array (nullable = true).

推荐答案

好了,一点点哈克的方式做到这一点,它不需要一个Python批处理作业,是这样的:

Well, a little bit hacky way to do it, which doesn't require a Python batch job, is something like this:

from pyspark.sql.functions import col, lit, size
from functools import reduce
from operator import and_

def array_equal(c, an_array):
    same_size = size(c) == len(an_array)  # Check if the same size
    # Check if all items equal
    same_items = reduce(
        and_, 
        (c.getItem(i) == an_array[i] for i in range(len(an_array)))
    )
    return and_(same_size, same_items)

快速测试:

df = sc.parallelize([
    (1, ['list','of' , 'stuff']),
    (2, ['foo', 'bar']),
    (3, ['foobar']),
    (4, ['list','of' , 'stuff', 'and', 'foo']),
    (5, ['a', 'list','of' , 'stuff']),
]).toDF(['id', 'a'])

df.where(array_equal(col('a'), ['list','of' , 'stuff'])).show()
## +---+-----------------+
## | id|                a|
## +---+-----------------+
## |  1|[list, of, stuff]|
## +---+-----------------+

这篇关于通过是否列值筛选等于火花列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆