过滤数组列的内容 [英] Filter array column content

查看:45
本文介绍了过滤数组列的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用pyspark 2.3.1,并希望使用表达式而不是使用udf来过滤数组元素:

I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:

>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2|           col3|
+----+----+---------------+
|   1|   A|   [1, 2, 3, 4]|
|   2|   B|[1, 2, 3, 4, 5]|
+----+----+---------------+

下面显示的表达式是错误的,我想知道如何告诉spark从col3的数组中删除小于3的任何值.我想要类似的东西:

The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:

>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2|   newcol|
+----+----+---------+
|   1|   A|   [3, 4]|
|   2|   B|[3, 4, 5]|
+----+----+---------+

我已经有一个udf解决方案,但是它非常慢(> 10亿个数据行):

I have already an udf solution, but it is very slow (> 1 billions data rows):

largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))

欢迎任何帮助.提前非常感谢您.

Any help is welcome. Thank you very much in advance.

推荐答案

火花< 2.4

在PySpark中没有* c0>的合理替代品.

There is no *reasonable replacement for udf in PySpark.

火花> = 2.4

您的代码:

expr("filter(col3, x -> x >= 3)")

可以原样使用.

参考

查询具有复杂类型的Spark SQL DataFrame

*考虑到RDD udf的爆炸或转换成本几乎是最好的选择.

* Given the cost of exploding or converting to and from RDD udf is almost exclusively preferable.

这篇关于过滤数组列的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆