PySpark数字窗口分组依据 [英] PySpark Numeric Window Group By
本文介绍了PySpark数字窗口分组依据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我希望能够按步长设置Spark组,而不是单个值。有没有类似PySpark 2.x的 window
函数用于数值(非日期)值的东西?
I'd like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x's window
function for numeric (non-date) values?
类似于以下内容:
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([10, 11, 12, 13], "integer").toDF("foo")
res = df.groupBy(window("foo", step=2, start=10)).count()
推荐答案
您可以重用时间戳一并在几秒钟内表达参数。滚动:
You can reuse timestamp one and express parameters in seconds. Tumbling:
from pyspark.sql.functions import col, window
df.withColumn(
"window",
window(
col("foo").cast("timestamp"),
windowDuration="2 seconds"
).cast("struct<start:bigint,end:bigint>")
).show()
# +---+-------+
# |foo| window|
# +---+-------+
# | 10|[10,12]|
# | 11|[10,12]|
# | 12|[12,14]|
# | 13|[12,14]|
# +---+-------+
滚动一:
df.withColumn(
"window",
window(
col("foo").cast("timestamp"),
windowDuration="2 seconds", slideDuration="1 seconds"
).cast("struct<start:bigint,end:bigint>")
).show()
# +---+-------+
# |foo| window|
# +---+-------+
# | 10| [9,11]|
# | 10|[10,12]|
# | 11|[10,12]|
# | 11|[11,13]|
# | 12|[11,13]|
# | 12|[12,14]|
# | 13|[12,14]|
# | 13|[13,15]|
# +---+-------+
使用 groupBy
和 start
:
w = window(col("foo").cast("timestamp"), "2 seconds").cast("struct<start:bigint,end:bigint>")
start = w.start.alias("start")
df.groupBy(start).count().show()
+-----+-----+
|start|count|
+-----+-----+
| 10| 2|
| 12| 2|
+-----+-----+
这篇关于PySpark数字窗口分组依据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文