数据框，指明分数zipWithIndex [英] DataFrame-ified zipWithIndex

查看：761 发布时间：2016/5/22 15:14:09 apache-spark apache-spark-sql

本文介绍了数据框，指明分数zipWithIndex的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解决增加一个序列号到数据集的老问题。我与DataFrames工作，而且似乎不等同于数据帧 RDD.zipWithIndex 。在另一方面，下面的工作或多或少我希望它的方式：

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RDD.zipWithIndex. On the other hand, the following works more or less the way I want it to:

val origDF = sqlContext.load(...)    

val seqDF= sqlContext.createDataFrame(
    origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)),
    StructType(Array(StructField("seq", LongType, false)) ++ origDF.schema.fields)
)

在我的实际应用中，origDF不会直接加载了一个文件 - 它是要通过加入其他2-3 DataFrames共同创建，将包含以上100万行

In my actual application, origDF won't be loaded directly out of a file -- it is going to be created by joining 2-3 other DataFrames together and will contain upwards of 100 million rows.

有没有更好的方法来做到这一点？我能做些什么来优化呢？

Is there a better way to do this? What can I do to optimize it?

推荐答案

的下面被张贴在代表大卫·格里芬（编辑出题）。的

全歌唱，全跳舞dfZipWithIndex方法。您可以设置起始偏移量（默认为1），索引列名（默认为ID），并将其放置在前面列或后面：

The all-singing, all-dancing dfZipWithIndex method. You can set the starting offset (which defaults to 1), the index column name (defaults to "id"), and place the column in the front or the back:

def dfZipWithIndex(
  df: DataFrame,
  offset: Int = 1,
  colName: String = "id",
  inFront: Boolean = true
) : DataFrame = {
  df.sqlContext.createDataFrame(
    df.rdd.zipWithIndex.map(ln =>
      Row.fromSeq(
        (if (inFront) Seq(ln._2 + offset) else Seq())
          ++ ln._1.toSeq ++
        (if (inFront) Seq() else Seq(ln._2 + offset))
      )
    ),
    StructType(
      (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) 
        ++ df.schema.fields ++ 
      (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
    )
  ) 
}

这篇关于数据框，指明分数zipWithIndex的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

数据框，指明分数zipWithIndex [英] DataFrame-ified zipWithIndex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

数据框，指明分数zipWithIndex [英] DataFrame-ified zipWithIndex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭