根据ID列，通过保留顺序将Spark DataFrame分为两个DataFrame(70％和30％) [英] Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

查看：78 发布时间：2021/4/8 19:58:42 apache-spark pyspark apache-spark-2.0

本文介绍了根据ID列，通过保留顺序将Spark DataFrame分为两个DataFrame(70％和30％)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个火花数据框，就像

I have a spark dataframe which is like

id  start_time   feature
1   01-01-2018   3.567
1   01-02-2018   4.454
1   01-03-2018   6.455
2   01-02-2018   343.4
2   01-08-2018   45.4
3   02-04-2018   43.56
3   02-07-2018   34.56
3   03-07-2018   23.6

我希望能够根据 id列将其分为两个数据帧.因此，我应该对id列进行分组，按start_time排序，然后将70％的行分成一个数据帧和30个行通过保留顺序将行中的％放入另一个数据帧中.结果应如下所示:

I want to be able to split this into two dataframes based on the id column.So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like:

Dataframe1:
id  start_time   feature
1   01-01-2018   3.567
1   01-02-2018   4.454
2   01-02-2018   343.4
3   02-04-2018   43.56
3   02-07-2018   34.56

Dataframe2:
1   01-03-2018   6.455
2   01-08-2018   45.4
3   03-07-2018   23.6

我正在将Spark 2.0与python结合使用.实现此目的的最佳方法是什么?

I am using Spark 2.0 with python. What is the best way to implement this?

推荐答案

我要做的方法是创建两个窗口:

The way I had to do it was to create two windows:

w1 =  Window.partitionBy(df.id).orderBy(df.start_time)
w2 =  Window.partitionBy(df.id)

df = df.withColumn("row_number",F.row_number().over(w1))\
                     .withColumn("count",F.count("id").over(w2))\
                     .withColumn("percent",(F.col("row_number")/F.col("count")))
train = df.filter(df.percent<=0.70)
test = df.filter(df.percent>0.70)

这篇关于根据ID列，通过保留顺序将Spark DataFrame分为两个DataFrame(70％和30％)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据ID列，通过保留顺序将Spark DataFrame分为两个DataFrame(70％和30％) [英] Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据ID列，通过保留顺序将Spark DataFrame分为两个DataFrame(70％和30％) [英] Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭