在给定的一周中查找PySpark中的行数 [英] Find number of rows in a given week in PySpark
问题描述
我有一个PySpark数据框,其一小部分如下:
I have a PySpark dataframe, a small portion of which is given below:
+------+-----+-------------------+-----+
| name| type| timestamp|score|
+------+-----+-------------------+-----+
| name1|type1|2012-01-10 00:00:00| 11|
| name1|type1|2012-01-10 00:00:10| 14|
| name1|type1|2012-01-10 00:00:20| 2|
| name1|type1|2012-01-10 00:00:30| 3|
| name1|type1|2012-01-10 00:00:40| 55|
| name1|type1|2012-01-10 00:00:50| 10|
| name5|type1|2012-01-10 00:01:00| 5|
| name2|type2|2012-01-10 00:01:10| 8|
| name5|type1|2012-01-10 00:01:20| 1|
|name10|type1|2012-01-10 00:01:30| 12|
|name11|type3|2012-01-10 00:01:40| 512|
+------+-----+-------------------+-----+
对于一个选定的时间窗口(例如1 week
的窗口),我想找出每个name
有多少score
(例如num_values_week
)个值.也就是说,在2012-01-10 - 2012-01-16
之间,然后在2012-01-16 - 2012-01-23
等之间,name1
的score
值有多少(对于所有其他名称,如name2
等)也是如此.
For a chosen time window (say windows of 1 week
) , I want to find out how many values of score
(say num_values_week
) are there for every name
. That is, how many values of score
are there for name1
between 2012-01-10 - 2012-01-16
, then between 2012-01-16 - 2012-01-23
and so forth (and same for all other names, like name2
and so on.)
我想将此信息转换为新的PySpark数据框,该数据框将具有列name
,type
,num_values_week
.我该怎么办?
I want to have cast this information in new PySpark data frame that will have the columns name
, type
, num_values_week
. How can I do this?
上面给出的PySpark数据框可以使用以下代码段创建:
The PySpark dataframe given above can be created using the following code snippet:
from pyspark.sql import *
import pyspark.sql.functions as F
df_Stats = Row("name", "type", "timestamp", "score")
df_stat1 = df_Stats('name1', 'type1', "2012-01-10 00:00:00", 11)
df_stat2 = df_Stats('name2', 'type2', "2012-01-10 00:00:00", 14)
df_stat3 = df_Stats('name3', 'type3', "2012-01-10 00:00:00", 2)
df_stat4 = df_Stats('name4', 'type1', "2012-01-17 00:00:00", 3)
df_stat5 = df_Stats('name5', 'type3', "2012-01-10 00:00:00", 55)
df_stat6 = df_Stats('name2', 'type2', "2012-01-17 00:00:00", 10)
df_stat7 = df_Stats('name7', 'type3', "2012-01-24 00:00:00", 5)
df_stat8 = df_Stats('name8', 'type2', "2012-01-17 00:00:00", 8)
df_stat9 = df_Stats('name1', 'type1', "2012-01-24 00:00:00", 1)
df_stat10 = df_Stats('name10', 'type2', "2012-01-17 00:00:00", 12)
df_stat11 = df_Stats('name11', 'type3', "2012-01-24 00:00:00", 512)
df_stat_lst = [df_stat1 , df_stat2, df_stat3, df_stat4, df_stat5,
df_stat6, df_stat7, df_stat8, df_stat9, df_stat10, df_stat11]
df = spark.createDataFrame(df_stat_lst)
推荐答案
类似以下内容:
from pyspark.sql.functions import weekofyear, count
df = df.withColumn( "week_nr", weekofyear(df.timestamp) ) # create the week number first
result = df.groupBy(["week_nr","name"]).agg(count("score")) # for every week see how many rows there are
这篇关于在给定的一周中查找PySpark中的行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!