过滤数组列内容 [英] Filter array column content

查看：23 发布时间：2021/11/14 21:46:52 apache-spark pyspark pyspark-sql

本文介绍了过滤数组列内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 pyspark 2.3.1 并且想使用表达式而不是使用 udf 过滤数组元素:

<预><代码>>>>df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])>>>df.show()+----+----+---------------+|col1|col2|col3|+----+----+---------------+|1|A|[1, 2, 3, 4]||2|B|[1, 2, 3, 4, 5]|+----+----+---------------+

下面显示的表达式是错误的，我想知道如何告诉 spark 从 col3 中的数组中删除任何小于 3 的值.我想要类似的东西:

<预><代码>>>>过滤 = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()>>>过滤.show()+----+----+---------+|col1|col2|新上衣|+----+----+---------+|1|A|[3, 4]||2|B|[3, 4, 5]|+----+----+---------+

我已经有一个 udf 解决方案，但速度很慢(> 10 亿数据行):

largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))

欢迎任何帮助.预先非常感谢您.

解决方案

Spark 2.4

PySpark 中的 udf 没有*合理的替代品.

Spark >= 2.4

您的代码:

expr("filter(col3, x -> x >= 3)")

可以按原样使用.

参考

查询具有复杂类型的 Spark SQL DataFrame

<小时>
I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:
>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"]) >>> df.show() +----+----+---------------+ |col1|col2| col3| +----+----+---------------+ | 1| A| [1, 2, 3, 4]| | 2| B|[1, 2, 3, 4, 5]| +----+----+---------------+
The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:
>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show() >>> filtered.show() +----+----+---------+ |col1|col2| newcol| +----+----+---------+ | 1| A| [3, 4]| | 2| B|[3, 4, 5]| +----+----+---------+
I have already an udf solution, but it is very slow (> 1 billions data rows):
largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType())) df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))
Any help is welcome. Thank you very much in advance.
解决方案
Spark < 2.4

There is no *reasonable replacement for udf in PySpark.

Spark >= 2.4

Your code:
expr("filter(col3, x -> x >= 3)")
can be used as is.

Reference

Querying Spark SQL DataFrame with complex types

* Given the cost of exploding or converting to and from RDD udf is almost exclusively preferable.

这篇关于过滤数组列内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

过滤数组列内容 [英] Filter array column content

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

过滤数组列内容 [英] Filter array column content

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭