PySpark DataFrames:在数组列中过滤某些值的地方 [英] PySpark DataFrames: filter where some value is in array column
问题描述
我在PySpark中有一个DataFrame,该DataFrame的其中一个字段具有嵌套数组值.我想过滤数组包含某个字符串的DataFrame.我不知道该怎么做.
I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.
模式如下所示:
root
|-- name: string (nullable = true)
|-- lastName: array (nullable = true)
| |-- element: string (containsNull = false)
The schema looks like this:
root
|-- name: string (nullable = true)
|-- lastName: array (nullable = true)
| |-- element: string (containsNull = false)
我想返回upper(name) == 'JOHN'
和lastName
列(数组)包含'SMITH'
的所有行,并且相等应该不区分大小写(就像我对名称所做的一样).我在一个列值上找到了isin()
函数,但这似乎倒退了我想要的东西.似乎我需要对列值使用contains()
函数.有人对实现此目的的直接方法有任何想法吗?
I want to return all the rows where the upper(name) == 'JOHN'
and where the lastName
column (the array) contains 'SMITH'
and the equality there should be case insensitive (like I did for the name). I found the isin()
function on a column value, but that seems to work backwards of what I want. It seem like I need a contains()
function on a column value. Anyone have any ideas for a straightforward way to do this?
推荐答案
2019年的更新
spark 2.4.0引入了新功能,例如array_contains
和transform
官方文件
现在可以用sql语言完成
spark 2.4.0 introduced new functions like array_contains
and transform
official document
now it can be done in sql language
对于您的问题,应该是
dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')
这比以前的使用RDD
作为桥接的解决方案要好,因为DataFrame
的操作要比RDD
的操作快得多.
It is better than the previous solution using RDD
as a bridge, because DataFrame
operations are much faster than RDD
ones.
这篇关于PySpark DataFrames:在数组列中过滤某些值的地方的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!