pyspark中的ARRAY_CONTAINS多个值 [英] ARRAY_CONTAINS muliple values in pyspark

查看：120 发布时间：2021/5/14 19:06:43 python sql hive pyspark

本文介绍了pyspark中的ARRAY_CONTAINS多个值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 pyspark.sql.dataframe.DataFrame .我想基于多个变量而不是单个 {val} 来过滤 stack 的行.我正在使用Python 2 Jupyter笔记本.目前，我将执行以下操作:

I am working with a pyspark.sql.dataframe.DataFrame. I would like to filter stack's rows based on multiple variables, rather than a single one, {val}. I am working with a Python 2 Jupyter notebook. Presently, I do the following:

stack = hiveContext.sql("""
    SELECT * 
    FROM db.table
    WHERE col_1 != ''
""")

stack.show()
+---+-------+-------+---------+
| id| col_1 | . . . | list    |
+---+-------+-------+---------+
| 1 |   524 | . . . |[1, 2]   |
| 2 |   765 | . . . |[2, 3]   |
.
.
.
| 9 |   765 | . . . |[4, 5, 8]|

for i in len(list):
    filtered_stack = stack.filter("array_contains(list, {val})".format(val=val.append(list[i])))
    (some query on filtered_stack)

我该如何在Python代码中重写此代码，以基于多个值过滤行?即{val}等于一个或多个元素的某个数组.

How would I rewrite this in Python code to filter rows based on more than one value? i.e. where {val} is equal to some array of one or more elements.

我的问题与以下内容有关: ARRAY_CONTAINS配置单元中的多个值，但是我正在尝试在Python 2 Jupyter笔记本中实现上述目标.

My question is related to: ARRAY_CONTAINS muliple values in hive, however I'm trying to achieve the above in a Python 2 Jupyter notebook.

推荐答案

使用Python UDF:

With Python UDF:

from pyspark.sql.functions import udf, size
from pyspark.sql.types import *

intersect = lambda type: (udf(
    lambda x, y: (
        list(set(x) & set(y)) if x is not None and y is not None else None),
    ArrayType(type)))

df = sc.parallelize([([1, 2, 3], [1, 2]), ([3, 4], [5, 6])]).toDF(["xs", "ys"])

integer_intersect = intersect(IntegerType())

df.select(
    integer_intersect("xs", "ys"),
    size(integer_intersect("xs", "ys"))).show()

+----------------+----------------------+
|<lambda>(xs, ys)|size(<lambda>(xs, ys))|
+----------------+----------------------+
|          [1, 2]|                     2|
|              []|                     0|
+----------------+----------------------+

带文字:

from pyspark.sql.functions import array, lit

df.select(integer_intersect("xs", array(lit(1), lit(5)))).show()

+-------------------------+
|<lambda>(xs, array(1, 5))|
+-------------------------+
|                      [1]|
|                       []|
+-------------------------+

或

df.where(size(integer_intersect("xs", array(lit(1), lit(5)))) > 0).show()

+---------+------+
|       xs|    ys|
+---------+------+
|[1, 2, 3]|[1, 2]|
+---------+------+

这篇关于pyspark中的ARRAY_CONTAINS多个值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark中的ARRAY_CONTAINS多个值 [英] ARRAY_CONTAINS muliple values in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pyspark中的ARRAY_CONTAINS多个值 [英] ARRAY_CONTAINS muliple values in pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭