如何使用DataFrames使用窗口功能PySpark? [英] How to use window functions in PySpark using DataFrames?

查看:722
本文介绍了如何使用DataFrames使用窗口功能PySpark?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图找出如何在PySpark使用窗口功能。下面是想我是能够做一个例子,简单地计算的次数用户拥有一个事件(在这种情况下,DT是一个模拟的时间戳)。

Trying to figure out how to use window functions in PySpark. Here's an example of what I'd like to be able to do, simply count the number of times a user has an "event" (in this case "dt" is a simulated timestamp).

from pyspark.sql.window import Window
from pyspark.sql.functions import count

df = sqlContext.createDataFrame([{"id": 123, "dt": 0}, {"id": 123, "dt": 1}, {"id": 234, "dt":0}, {"id": 456, "dt":0}, {"id": 456, "dt":1}, {"id":456, "dt":2}])
df.select(["id","dt"], count("dt").over(Window.partitionBy("id").orderBy("dt")).alias("count")).show()

这将产生一个错误。什么是使用窗口函数的正确方法?我读了1.4.1(版本我们需要使用,因为它是什么在AWS标准)应该能够与数据框API做这些。

This produces an error. What is the correct way to use window functions? I read that 1.4.1 (the version we need to use since it's what is standard on AWS) should be able to do them with the DataFrame API.

FWIW,该文件是pretty关于这个问题的很少。我有麻烦的例子实际运行。

FWIW, the documentation is pretty sparse on this subject. And I had trouble getting any examples actually running.

推荐答案

它抛出一个异常。 DataFrame.select 的签名如下所示

It throws an exception because you pass a list of columns. Signature of DataFrame.select looks as follows

df.select(self, *cols)

和使用窗口功能的EX pression就像其他任何一列,你需要在这里等什么是这样的:

and an expression using a window function is a column like any other so what you need here is something like this:

w = Window.partitionBy("id").orderBy("dt") # Just for clarity
df.select("id","dt", count("dt").over(w).alias("count")).show()

## +---+---+-----+
## | id| dt|count|
## +---+---+-----+
## |234|  0|    1|
## |456|  0|    1|
## |456|  1|    2|
## |456|  2|    3|
## |123|  0|    1|
## |123|  1|    2|
## +---+---+-----+

一般而言火花SQL窗口函数的行为完全一样的方式在任何现代RDBMS。

Generally speaking Spark SQL window functions behave exactly the same way as in any modern RDBMS.

这篇关于如何使用DataFrames使用窗口功能PySpark?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆