从 PySpark DataFrame 的 Python 列表列表中删除一个元素 [英] Remove an element from a Python list of lists in PySpark DataFrame

查看：88 发布时间：2021/11/14 22:02:48 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了从 PySpark DataFrame 的 Python 列表列表中删除一个元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 Python 列表列表中删除一个元素:

I am trying to remove an element from a Python list of lists:

+---------------+
|        sources|
+---------------+
|           [62]|
|        [7, 32]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|    [7, 32, 62]|

我希望能够从上面列表中的每个列表中删除一个元素 rm.我编写了一个函数，可以为列表列表执行此操作:

I want to be able to remove an element, rm, from each of the lists in the list above. I wrote a function that can do that for a list of lists:

def asdf(df, rm):
    temp = df
    for n in range(len(df)):
        temp[n] = [x for x in df[n] if x != rm]
    return(temp)

确实删除了 rm = 1:

a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]
In:  asdf(a,1)
Out: [[2, 3], [2, 3, 4], [2, 3, 4, 5]]

但我无法让它为 DataFrame 工作:

But I can't get it to work for a DataFrame:

asdfUDF = udf(asdf, ArrayType(IntegerType()))

In: df.withColumn("src_ex", asdfUDF("sources", 32))

Out: Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist

期望的行为:

In: df.withColumn("src_ex", asdfUDF("sources", 32))
Out: 

+---------------+
|         src_ex|
+---------------+
|           [62]|
|            [7]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|        [7, 62]|

(除了将上面的新列附加到 PySpark DataFrame，df)

(except have the new column above appended to a PySpark DataFrame, df)

有什么建议或想法吗?

推荐答案

Spark >= 2.4

你可以使用array_remove:

from pyspark.sql.functions import array_remove

df.withColumn("src_ex", array_remove("sources", 32)).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

或 filter:

from pyspark.sql.functions import expr

df.withColumn("src_ex", expr("filter(sources, x -> not(x <=> 32))")).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

火花<2.4

一些事情:

DataFrame 不是列表列表.实际上，它甚至不是一个普通的 Python 对象，它没有 len 并且它不是 Iterable.
您拥有的列看起来像普通的 array 类型.
您不能引用 DataFrame(或 UDF 中的任何其他分布式数据结构).
直接传递给 UDF 调用的每个参数都必须是 str(列名)或 Column 对象.要传递文字，请使用 lit 函数.

DataFrame is not a list of lists. In practice it is not even a plain Python object, it has no len and it is not Iterable.
Column you have looks like plain array type.
You cannot reference DataFrame (or any other distributed data structure inside UDF).
Every argument passed directly to UDF call has to be a str (column name) or Column object. To pass literal use lit function.

唯一剩下的就是一个列表推导式:

The only thing that remains is just a list comprehension:

from pyspark.sql.functions import lit, udf

def drop_from_array_(arr, item):
    return [x for x in arr if x != item]

drop_from_array = udf(drop_from_array_, ArrayType(IntegerType()))

示例用法:

df = sc.parallelize([
    [62], [7, 32], [62], [18, 36, 62], [7, 31, 36, 62], [7, 32, 62]
]).map(lambda x: (x, )).toDF(["sources"])

df.withColumn("src_ex", drop_from_array("sources", lit(32)))

结果:

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

这篇关于从 PySpark DataFrame 的 Python 列表列表中删除一个元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 PySpark DataFrame 的 Python 列表列表中删除一个元素 [英] Remove an element from a Python list of lists in PySpark DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从 PySpark DataFrame 的 Python 列表列表中删除一个元素 [英] Remove an element from a Python list of lists in PySpark DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭