如何在Spark RDD中添加新列? [英] How to add a new column to a Spark RDD?

查看:1794
本文介绍了如何在Spark RDD中添加新列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有 MANY 列的RDD(例如,数百),如何在该RDD的末尾再添加一列?

I have a RDD with MANY columns (e.g., hundreds), how do I add one more column at the end of this RDD?

例如,如果我的RDD如下:

For example, if my RDD is like below:

    123, 523, 534, ..., 893
    536, 98, 1623, ..., 98472
    537, 89, 83640, ..., 9265
    7297, 98364, 9, ..., 735
    ......
    29, 94, 956, ..., 758

如何向其中添加一列,其值是第二和第三列的总和?

how can I add a column to it, whose value is the sum of the second and the third columns?

非常感谢您.

推荐答案

您完全不必使用Tuple *对象向RDD中添加新列.

You do not have to use Tuple* objects at all for adding a new column to an RDD.

可以通过映射每行,获取其原始内容以及要添加的元素来完成此操作,例如:

It can be done by mapping each row, taking its original contents plus the elements you want to append, for example:

val rdd = ...
val withAppendedColumnsRdd = rdd.map(row => {
  val originalColumns = row.toSeq.toList
  val secondColValue = originalColumns(1).asInstanceOf[Int]
  val thirdColValue = originalColumns(2).asInstanceOf[Int]
  val newColumnValue = secondColValue + thirdColValue 
  Row.fromSeq(originalColumns :+ newColumnValue)
  // Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns
})

这篇关于如何在Spark RDD中添加新列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆