根据ID列,通过保留顺序将Spark DataFrame分为两个DataFrame(70%和30%) [英] Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order
本文介绍了根据ID列,通过保留顺序将Spark DataFrame分为两个DataFrame(70%和30%)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个火花数据框,就像
I have a spark dataframe which is like
id start_time feature
1 01-01-2018 3.567
1 01-02-2018 4.454
1 01-03-2018 6.455
2 01-02-2018 343.4
2 01-08-2018 45.4
3 02-04-2018 43.56
3 02-07-2018 34.56
3 03-07-2018 23.6
我希望能够根据 id列将其分为两个数据帧.因此,我应该对id列进行分组,按start_time排序,然后将70%的行分成一个数据帧和30个行通过保留顺序将行中的%放入另一个数据帧中.结果应如下所示:
I want to be able to split this into two dataframes based on the id column.So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like:
Dataframe1:
id start_time feature
1 01-01-2018 3.567
1 01-02-2018 4.454
2 01-02-2018 343.4
3 02-04-2018 43.56
3 02-07-2018 34.56
Dataframe2:
1 01-03-2018 6.455
2 01-08-2018 45.4
3 03-07-2018 23.6
我正在将Spark 2.0与python结合使用.实现此目的的最佳方法是什么?
I am using Spark 2.0 with python. What is the best way to implement this?
推荐答案
我要做的方法是创建两个窗口:
The way I had to do it was to create two windows:
w1 = Window.partitionBy(df.id).orderBy(df.start_time)
w2 = Window.partitionBy(df.id)
df = df.withColumn("row_number",F.row_number().over(w1))\
.withColumn("count",F.count("id").over(w2))\
.withColumn("percent",(F.col("row_number")/F.col("count")))
train = df.filter(df.percent<=0.70)
test = df.filter(df.percent>0.70)
这篇关于根据ID列,通过保留顺序将Spark DataFrame分为两个DataFrame(70%和30%)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文