Scala:如何按行号拆分数据帧? [英] Scala: How can I split up a dataframe by row number?

查看：26 发布时间：2021/11/14 22:47:59 scala apache-spark split apache-spark-sql databricks

本文介绍了Scala:如何按行号拆分数据帧?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想将 270 万行的数据帧拆分为 100000 行的小数据帧，因此最终得到 27 个数据帧，我也想将其存储为 csv 文件.

I want to split up a dataframe of 2,7 million rows into small dataframes of 100000 rows, so end up with like 27 dataframes, which I want to store as csv files too.

我已经查看了这个 partitionBy 和 groupBy，但我不需要担心任何条件，除了它们必须按日期排序.我正在尝试编写自己的代码来完成这项工作，但如果您了解我可以使用的一些 Scala (Spark) 函数，那就太好了！

I took a look at this partitionBy and groupBy already, but I don't need to worry about any conditions, except that they have to be ordered by date. I am trying to write my own code to make this work, but if you know about some Scala (Spark) functions I could use, that would be great!

谢谢大家的建议！

推荐答案

您可以使用 RDD API 中的 zipWithIndex(遗憾的是在 SparkSQL 中没有等效项)将每一行映射到一个索引，范围介于 0 和 rdd.count - 1.

You could use zipWithIndex from the RDD API (no equivalent in SparkSQL unfortunately) that maps each row to an index, ranging between 0 and rdd.count - 1.

因此，如果您有一个我认为已相应排序的数据框，则您需要按如下方式在两个 API 之间来回切换:

So if you have a dataframe that I assumed to be sorted accordingly, you would need to go back and forth between the two APIs as follows:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

// creating mock data
val df = spark.range(100).withColumn("test", 'id % 10)

// zipping the data
val partitionSize = 5 // I use 5 but you can use 100000 in your case
val zipped_rdd = df.rdd
    .zipWithIndex.map{ case (row, id) => 
        Row.fromSeq(row.toSeq :+ id / partitionSize ) 
    }

//back to df
val newField = StructField("partition", LongType, false)
val zipped_df = spark
    .createDataFrame(zipped_rdd, df.schema.add(newField))

让我们看一下数据，我们有一个名为 partition 的新列，它对应于您想要拆分数据的方式.

Let's have a look at the data, we have a new column called partition and that corresponds to the way you want to split your data.

zipped_df.show(15) // 5 rows by partition
+---+----+---------+
| id|test|partition|
+---+----+---------+
|  0|   0|        0|
|  1|   1|        0|
|  2|   2|        0|
|  3|   3|        0|
|  4|   4|        0|
|  5|   5|        1|
|  6|   6|        1|
|  7|   7|        1|
|  8|   8|        1|
|  9|   9|        1|
| 10|   0|        2|
| 11|   1|        2|
| 12|   2|        2|
| 13|   3|        2|
| 14|   4|        2|
+---+----+---------+

// using partitionBy to write the data
zipped_df.write
    .partitionBy("partition")
    .csv(".../testPart.csv")

这篇关于Scala:如何按行号拆分数据帧?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scala:如何按行号拆分数据帧? [英] Scala: How can I split up a dataframe by row number?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Scala:如何按行号拆分数据帧? [英] Scala: How can I split up a dataframe by row number?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭