如何计算不同数据框的列之间的数值差? [英] How to compute the numerical difference between columns of different dataframes?

查看:65
本文介绍了如何计算不同数据框的列之间的数值差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定两个具有相同列和行数的spark数据框A和B,我想计算两个数据框之间的数值差并将其存储到另一个数据框(或可选的另一个数据结构)中.

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).

例如,让我们拥有以下数据集

For instance let us have the following datasets

DataFrame A:

DataFrame A:

+----+---+
|  A | B |
+----+---+
|   1|  0|
|   1|  0|
+----+---+

DataFrame B:

DataFrame B:

----+---+
|  A | B |
+----+---+
|   1| 0 |
|   0| 0 |
+----+---+

如何获取B-A,即


+----+---+
| c1 | c2|
+----+---+
|   0| 0 |
|  -1| 0 |
+----+---+

实际上,实际数据帧具有相应的行数和50+列,因此需要计算差异.Spark/Scala的处理方式是什么?

In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?

推荐答案

我能够使用以下方法解决此问题.此代码可以使用任意数量的列.您只需要相应地更改输入DF.

I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.

import org.apache.spark.sql.Row

val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")

val columns = df0.columns
    val rdd = df0.rdd.zip(df1.rdd).map {
      x =>
        val arr = columns.map(column =>
          x._2.getAs[Int](column) - x._1.getAs[Int](column))
        Row(arr: _*)
    }

spark.createDataFrame(rdd, df0.schema).show(false)

生成的输出:

df0=>
+---+---+
|a  |b  |
+---+---+
|1  |5  |
|1  |4  |
+---+---+
df1=>
+---+---+
|a  |b  |
+---+---+
|1  |0  |
|3  |2  |
+---+---+
Output=>
+---+---+
|a  |b  |
+---+---+
|0  |-5 |
|2  |-2 |
+---+---+

这篇关于如何计算不同数据框的列之间的数值差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆