从PySpark DataFrame中的Python列表列表中删除一个元素 [英] Remove an element from a Python list of lists in PySpark DataFrame

查看：377 发布时间：2020/9/4 4:01:45 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了从PySpark DataFrame中的Python列表列表中删除一个元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从列表的Python列表中删除一个元素:

I am trying to remove an element from a Python list of lists:

+---------------+
|        sources|
+---------------+
|           [62]|
|        [7, 32]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|    [7, 32, 62]|

我希望能够从上面列表中的每个列表中删除元素rm.我写了一个函数，可以对列表列表执行此操作:

I want to be able to remove an element, rm, from each of the lists in the list above. I wrote a function that can do that for a list of lists:

def asdf(df, rm):
    temp = df
    for n in range(len(df)):
        temp[n] = [x for x in df[n] if x != rm]
    return(temp)

确实删除了rm = 1:

a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]
In:  asdf(a,1)
Out: [[2, 3], [2, 3, 4], [2, 3, 4, 5]]

但是我无法使它适用于DataFrame:

But I can't get it to work for a DataFrame:

asdfUDF = udf(asdf, ArrayType(IntegerType()))

In: df.withColumn("src_ex", asdfUDF("sources", 32))

Out: Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist

期望的行为:

In: df.withColumn("src_ex", asdfUDF("sources", 32))
Out: 

+---------------+
|         src_ex|
+---------------+
|           [62]|
|            [7]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|        [7, 62]|

(除了将上面的新列附加到PySpark DataFrame之后，df)

(except have the new column above appended to a PySpark DataFrame, df)

有什么建议或想法吗?

推荐答案

火花> = 2.4

您可以使用array_remove:

from pyspark.sql.functions import array_remove

df.withColumn("src_ex", array_remove("sources", 32)).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

或filter:

from pyspark.sql.functions import expr

df.withColumn("src_ex", expr("filter(sources, x -> not(x <=> 32))")).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

火花< 2.4

一些事情:

DataFrame不是列表的列表.实际上，它甚至不是普通的Python对象，没有len，也没有Iterable.
您看到的列看起来像普通的array类型.
您不能引用DataFrame(或UDF中的任何其他分布式数据结构).
直接传递给UDF调用的每个参数都必须是str(列名)或Column对象.要传递文字，请使用lit函数.

DataFrame is not a list of lists. In practice it is not even a plain Python object, it has no len and it is not Iterable.
Column you have looks like plain array type.
You cannot reference DataFrame (or any other distributed data structure inside UDF).
Every argument passed directly to UDF call has to be a str (column name) or Column object. To pass literal use lit function.

唯一剩下的就是列表理解:

The only thing that remains is just a list comprehension:

from pyspark.sql.functions import lit, udf

def drop_from_array_(arr, item):
    return [x for x in arr if x != item]

drop_from_array = udf(drop_from_array_, ArrayType(IntegerType()))

示例用法:

df = sc.parallelize([
    [62], [7, 32], [62], [18, 36, 62], [7, 31, 36, 62], [7, 32, 62]
]).map(lambda x: (x, )).toDF(["sources"])

df.withColumn("src_ex", drop_from_array("sources", lit(32)))

结果:

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

这篇关于从PySpark DataFrame中的Python列表列表中删除一个元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从PySpark DataFrame中的Python列表列表中删除一个元素 [英] Remove an element from a Python list of lists in PySpark DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从PySpark DataFrame中的Python列表列表中删除一个元素 [英] Remove an element from a Python list of lists in PySpark DataFrame

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭