在 Apache Spark 1.3 中将一列附加到数据帧 [英] Append a column to Data Frame in Apache Spark 1.3

查看：22 发布时间：2021/11/12 5:30:36 scala apache-spark dataframe

本文介绍了在 Apache Spark 1.3 中将一列附加到数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

将列添加到数据框是否可能以及最有效的简洁方法是什么?

Is it possible and what would be the most efficient neat method to add a column to Data Frame?

更具体地说，列可以用作现有数据框的行 ID.

More specifically, column may serve as Row IDs for the existing Data Frame.

在一个简化的情况下，从文件中读取而不是标记它，我可以想到以下内容(在 Scala 中)，但它以错误完成(在第 3 行)，无论如何看起来都不是最佳路线:

In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:

var dataDF = sc.textFile("path/file").toDF() 
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID") 
dataDF = dataDF.withColumn("ID", rowDF("ID"))

推荐答案

我发布问题已经有一段时间了，似乎其他人也想得到答案.以下是我发现的.

It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.

所以最初的任务是将一个带有行标识符的列(基本上，一个序列 1 到 numRows)附加到任何给定的数据帧，这样行的顺序/存在可以被跟踪(例如，当你样本).这可以通过以下方式实现:

So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:

sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))

关于将任何列附加到任何数据框的一般情况:

Spark API 中与此功能最接近"的是 withColumn 和 withColumnRenamed.根据 Scala 文档，前者返回通过添加一列来创建一个新的 DataFrame.在我看来，这有点令人困惑和不完整的定义.这两个函数都只能对this 数据帧进行操作，即给定两个数据帧df1 和df2，列col>:

The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:

val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL

因此，除非您能够设法将现有数据框中的列转换为您需要的形状，否则您不能使用 withColumn 或 withColumnRenamed 来附加任意列(独立或其他数据框).

So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).

正如上面所评论的，解决方法可能是使用 join - 这会非常混乱，尽管可能 - 使用 zipWithIndex 附加像上面这样的唯一键数据框或列都可能工作.虽然效率是...

As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...

很明显，将列附加到数据框对于分布式环境来说并不是一项简单的功能，而且可能根本没有非常有效、简洁的方法.但我认为即使有性能警告，让这个核心功能可用仍然非常重要.

It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.

这篇关于在 Apache Spark 1.3 中将一列附加到数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Apache Spark 1.3 中将一列附加到数据帧 [英] Append a column to Data Frame in Apache Spark 1.3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Apache Spark 1.3 中将一列附加到数据帧 [英] Append a column to Data Frame in Apache Spark 1.3

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭