PySpark DataFrames:在数组列中过滤某些值的地方 [英] PySpark DataFrames: filter where some value is in array column

查看：714 发布时间：2020/9/4 20:03:20 pyspark pyspark-sql

本文介绍了PySpark DataFrames:在数组列中过滤某些值的地方的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在PySpark中有一个DataFrame，该DataFrame的其中一个字段具有嵌套数组值.我想过滤数组包含某个字符串的DataFrame.我不知道该怎么做.

I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.

模式如下所示: root |-- name: string (nullable = true) |-- lastName: array (nullable = true) | |-- element: string (containsNull = false)

The schema looks like this: root |-- name: string (nullable = true) |-- lastName: array (nullable = true) | |-- element: string (containsNull = false)

我想返回upper(name) == 'JOHN'和lastName列(数组)包含'SMITH'的所有行，并且相等应该不区分大小写(就像我对名称所做的一样).我在一个列值上找到了isin()函数，但这似乎倒退了我想要的东西.似乎我需要对列值使用contains()函数.有人对实现此目的的直接方法有任何想法吗?

I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there should be case insensitive (like I did for the name). I found the isin() function on a column value, but that seems to work backwards of what I want. It seem like I need a contains() function on a column value. Anyone have any ideas for a straightforward way to do this?

推荐答案

2019年的更新

spark 2.4.0引入了新功能，例如array_contains和transform 官方文件现在可以用sql语言完成

spark 2.4.0 introduced new functions like array_contains and transform official document now it can be done in sql language

对于您的问题，应该是

dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')

这比以前的使用RDD作为桥接的解决方案要好，因为DataFrame的操作要比RDD的操作快得多.

It is better than the previous solution using RDD as a bridge, because DataFrame operations are much faster than RDD ones.

这篇关于PySpark DataFrames:在数组列中过滤某些值的地方的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

PySpark DataFrames:在数组列中过滤某些值的地方 [英] PySpark DataFrames: filter where some value is in array column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

PySpark DataFrames:在数组列中过滤某些值的地方 [英] PySpark DataFrames: filter where some value is in array column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭