从 PySpark DataFrame 的 Python 列表列表中删除一个元素 [英] Remove an element from a Python list of lists in PySpark DataFrame

查看:88
本文介绍了从 PySpark DataFrame 的 Python 列表列表中删除一个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 Python 列表列表中删除一个元素:

I am trying to remove an element from a Python list of lists:

+---------------+
|        sources|
+---------------+
|           [62]|
|        [7, 32]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|    [7, 32, 62]|

我希望能够从上面列表中的每个列表中删除一个元素 rm.我编写了一个函数,可以为列表列表执行此操作:

I want to be able to remove an element, rm, from each of the lists in the list above. I wrote a function that can do that for a list of lists:

def asdf(df, rm):
    temp = df
    for n in range(len(df)):
        temp[n] = [x for x in df[n] if x != rm]
    return(temp)

确实删除了 rm = 1:

a = [[1,2,3],[1,2,3,4],[1,2,3,4,5]]
In:  asdf(a,1)
Out: [[2, 3], [2, 3, 4], [2, 3, 4, 5]]

但我无法让它为 DataFrame 工作:

But I can't get it to work for a DataFrame:

asdfUDF = udf(asdf, ArrayType(IntegerType()))

In: df.withColumn("src_ex", asdfUDF("sources", 32))

Out: Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist

期望的行为:

In: df.withColumn("src_ex", asdfUDF("sources", 32))
Out: 

+---------------+
|         src_ex|
+---------------+
|           [62]|
|            [7]|
|           [62]|
|   [18, 36, 62]|
|[7, 31, 36, 62]|
|        [7, 62]|

(除了将上面的新列附加到 PySpark DataFrame,df)

(except have the new column above appended to a PySpark DataFrame, df)

有什么建议或想法吗?

推荐答案

Spark >= 2.4

你可以使用array_remove:

from pyspark.sql.functions import array_remove

df.withColumn("src_ex", array_remove("sources", 32)).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

filter:

from pyspark.sql.functions import expr

df.withColumn("src_ex", expr("filter(sources, x -> not(x <=> 32))")).show()

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

火花<2.4

一些事情:

  • DataFrame 不是列表列表.实际上,它甚至不是一个普通的 Python 对象,它没有 len 并且它不是 Iterable.
  • 您拥有的列看起来像普通的 array 类型.
  • 您不能引用 DataFrame(或 UDF 中的任何其他分布式数据结构).
  • 直接传递给 UDF 调用的每个参数都必须是 str(列名)或 Column 对象.要传递文字,请使用 lit 函数.
  • DataFrame is not a list of lists. In practice it is not even a plain Python object, it has no len and it is not Iterable.
  • Column you have looks like plain array type.
  • You cannot reference DataFrame (or any other distributed data structure inside UDF).
  • Every argument passed directly to UDF call has to be a str (column name) or Column object. To pass literal use lit function.

唯一剩下的就是一个列表推导式:

The only thing that remains is just a list comprehension:

from pyspark.sql.functions import lit, udf

def drop_from_array_(arr, item):
    return [x for x in arr if x != item]

drop_from_array = udf(drop_from_array_, ArrayType(IntegerType()))

示例用法:

df = sc.parallelize([
    [62], [7, 32], [62], [18, 36, 62], [7, 31, 36, 62], [7, 32, 62]
]).map(lambda x: (x, )).toDF(["sources"])

df.withColumn("src_ex", drop_from_array("sources", lit(32)))

结果:

+---------------+---------------+
|        sources|         src_ex|
+---------------+---------------+
|           [62]|           [62]|
|        [7, 32]|            [7]|
|           [62]|           [62]|
|   [18, 36, 62]|   [18, 36, 62]|
|[7, 31, 36, 62]|[7, 31, 36, 62]|
|    [7, 32, 62]|        [7, 62]|
+---------------+---------------+

这篇关于从 PySpark DataFrame 的 Python 列表列表中删除一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆