使用列的长度过滤 DataFrame [英] Filtering DataFrame using the length of a column

查看:29
本文介绍了使用列的长度过滤 DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用与列长度相关的条件过滤 DataFrame,这个问题可能很简单,但我在 SO 中没有找到任何相关问题.

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO.

更具体地说,我有一个 DataFrame,只有一个 Column,其中 ArrayType(StringType()),我想过滤 Columncode>DataFrame 使用长度作为过滤器,我在下面拍摄了一个片段.

More specific, I have a DataFrame with only one Column which of ArrayType(StringType()), I want to filter the DataFrame using the length as filterer, I shot a snippet below.

df = sqlContext.read.parquet("letters.parquet")
df.show()

# The output will be 
# +------------+
# |      tokens|
# +------------+
# |[L, S, Y, S]|
# |[L, V, I, S]|
# |[I, A, N, A]|
# |[I, L, S, A]|
# |[E, N, N, Y]|
# |[E, I, M, A]|
# |[O, A, N, A]|
# |   [S, U, S]|
# +------------+

# But I want only the entries with length 3 or less
fdf = df.filter(len(df.tokens) <= 3)
fdf.show() # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect.

我阅读了列的文档,但没有发现任何对此事有用的属性.我感谢任何帮助!

I read Column's Documentation, but didn't find any property useful for this matter. I appreciate any help!

推荐答案

在 Spark >= 1.5 中你可以使用 size 函数:

In Spark >= 1.5 you can use size function:

from pyspark.sql.functions import col, size

df = sqlContext.createDataFrame([
    (["L", "S", "Y", "S"],  ),
    (["L", "V", "I", "S"],  ),
    (["I", "A", "N", "A"],  ),
    (["I", "L", "S", "A"],  ),
    (["E", "N", "N", "Y"],  ),
    (["E", "I", "M", "A"],  ),
    (["O", "A", "N", "A"],  ),
    (["S", "U", "S"],  )], 
    ("tokens", ))

df.where(size(col("tokens")) <= 3).show()

## +---------+
## |   tokens|
## +---------+
## |[S, U, S]|
## +---------+

在 Spark

1.5 UDF 应该可以解决问题:

In Spark < 1.5 an UDF should do the trick:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

size_ = udf(lambda xs: len(xs), IntegerType())

df.where(size_(col("tokens")) <= 3).show()

## +---------+
## |   tokens|
## +---------+
## |[S, U, S]|
## +---------+

如果你使用 HiveContext 那么 size 带有原始 SQL 的 UDF 应该适用于任何版本:

If you use HiveContext then size UDF with raw SQL should work with any version:

df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE size(tokens) <= 3").show()

## +--------------------+
## |              tokens|
## +--------------------+
## |ArrayBuffer(S, U, S)|
## +--------------------+

对于字符串列,您可以使用上面定义的 udflength 函数:

For string columns you can either use an udf defined above or length function:

from pyspark.sql.functions import length

df = sqlContext.createDataFrame([("fooo", ), ("bar", )], ("k", ))
df.where(length(col("k")) <= 3).show()

## +---+
## |  k|
## +---+
## |bar|
## +---+

这篇关于使用列的长度过滤 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆