如何计算不同数据帧的列之间的数值差异? [英] How to compute the numerical difference between columns of different dataframes?

查看:32
本文介绍了如何计算不同数据帧的列之间的数值差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定两个具有相同列数和行数的 spark 数据帧 A 和 B,我想计算两个数据帧之间的数值差异并将其存储到另一个数据帧(或可选的其他数据结构)中.

例如让我们有以下数据集

数据帧 A:

+----+---+|一个 |乙 |+----+---+|1|0||1|0|+----+---+

数据帧 B:

----+---+|一个 |乙 |+----+---+|1|0 ||0|0 |+----+---+

如何获得 B-A,即

<预><代码>+----+---+|c1 |c2|+----+---+|0|0 ||-1|0 |+----+---+

在实践中,真实的数据帧有相应的行数和 50 多列,需要计算其差异.Spark/Scala 的做法是什么?

解决方案

我能够通过使用以下方法解决这个问题.此代码可以处理任意数量的列.您只需要相应地更改输入 DF.

import org.apache.spark.sql.Rowval df0 = Seq((1, 5), (1, 4)).toDF("a", "b")val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")val 列 = df0.columnsval rdd = df0.rdd.zip(df1.rdd).map {x =>val arr = columns.map(column =>x._2.getAs[Int](column) - x._1.getAs[Int](column))行(arr:_*)}spark.createDataFrame(rdd, df0.schema).show(false)

生成的输出:

df0=>+---+---+|a |b |+---+---+|1 |5 ||1 |4 |+---+---+df1=>+---+---+|a |b |+---+---+|1 |0 ||3 |2 |+---+---+输出=>+---+---+|a |b |+---+---+|0 |-5 ||2 |-2 |+---+---+

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).

For instance let us have the following datasets

DataFrame A:

+----+---+
|  A | B |
+----+---+
|   1|  0|
|   1|  0|
+----+---+

DataFrame B:

----+---+
|  A | B |
+----+---+
|   1| 0 |
|   0| 0 |
+----+---+

How to obtain B-A, i.e


+----+---+
| c1 | c2|
+----+---+
|   0| 0 |
|  -1| 0 |
+----+---+

In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?

解决方案

I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.

import org.apache.spark.sql.Row

val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")

val columns = df0.columns
    val rdd = df0.rdd.zip(df1.rdd).map {
      x =>
        val arr = columns.map(column =>
          x._2.getAs[Int](column) - x._1.getAs[Int](column))
        Row(arr: _*)
    }

spark.createDataFrame(rdd, df0.schema).show(false)

Output generated:

df0=>
+---+---+
|a  |b  |
+---+---+
|1  |5  |
|1  |4  |
+---+---+
df1=>
+---+---+
|a  |b  |
+---+---+
|1  |0  |
|3  |2  |
+---+---+
Output=>
+---+---+
|a  |b  |
+---+---+
|0  |-5 |
|2  |-2 |
+---+---+

这篇关于如何计算不同数据帧的列之间的数值差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆