Spark中的多列操作 [英] Multiple-columns operations in Spark

查看:314
本文介绍了Spark中的多列操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Python的Pandas,可以在一遍中对多列进行批量操作,如下所示:

Using Python's Pandas, one can do bulk operations on multiple columns in one pass like this:

# assuming we have a DataFrame with, among others, the following columns
cols = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
df[cols] = df[cols] / df['another_column']

在Scala中使用Spark是否有类似的功能?

Is there a similar functionality using Spark in Scala?

目前,我最终会这样做:

Currently I end up doing:

val df2 = df.withColumn("col1", $"col1" / $"another_column")
            .withColumn("col2", $"col2" / $"another_column")
            .withColumn("col3", $"col3" / $"another_column")
            .withColumn("col4", $"col4" / $"another_column")
            .withColumn("col5", $"col5" / $"another_column")
            .withColumn("col6", $"col6" / $"another_column")
            .withColumn("col7", $"col7" / $"another_column")
            .withColumn("col8", $"col8" / $"another_column")

推荐答案

您可以使用foldLeft处理列列表,如下所示:

You can use foldLeft to process the column list as below:

val df = Seq(
  (1, 20, 30, 4),
  (2, 30, 40, 5),
  (3, 10, 30, 2)
).toDF("id", "col1", "col2", "another_column")

val cols = Array("col1", "col2")

val df2 = cols.foldLeft( df )( (acc, c) =>
  acc.withColumn( c, df(c) / df("another_column") )
)

df2.show
+---+----+----+--------------+
| id|col1|col2|another_column|
+---+----+----+--------------+
|  1| 5.0| 7.5|             4|
|  2| 6.0| 8.0|             5|
|  3| 5.0|15.0|             2|
+---+----+----+--------------+

这篇关于Spark中的多列操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆