如何在scala中比较两个数据帧 [英] how to compare two data frames in scala

查看:101
本文介绍了如何在scala中比较两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个完全相同的数据框用于比较测试

I have two exactly same dataframes for comparison test

     df1
     ------------------------------------------
     year | state | count2 | count3 | count4|
     2014 | NJ    | 12332  | 54322  | 53422 |
     2014 | NJ    | 12332  | 53255  | 55324 |
     2015 | CO    | 12332  | 53255  | 55324 |
     2015 | MD    | 14463  | 76543  | 66433 |
     2016 | CT    | 14463  | 76543  | 66433 |
     2016 | CT    | 55325  | 76543  | 66433 |
     ------------------------------------------
     df2
     ------------------------------------------
     year | state | count2 | count3 | count4|
     2014 | NJ    | 12332  | 54322  | 53422 |
     2014 | NJ    | 65333  | 65555  | 125   |
     2015 | CO    | 12332  | 53255  | 55324 |
     2015 | MD    | 533    | 75     | 64524 |
     2016 | CT    | 14463  | 76543  | 66433 |
     2016 | CT    | 55325  | 76543  | 66433 |
     ------------------------------------------

我想与count2到count4上的这两个df进行比较,如果计数不匹配,则打印出一些消息说它不匹配. 这是我的尝试

I want to compare with these two dfs on count2 to count4, if the counts doesn't match then print out some message saying it is mismatching. here is my try

     val cols = df1.columns.filter(_ != "year").toList
     def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null).otherwise(array($"l.$name", $"r.$name")).as(name)
     val result = df1.as("l").join(df2.as("r"), "year").select($"year" :: cols.map(mapDiffs): _*)

然后将其与具有相同数字的相同状态进行比较,它没有执行我想做的事

it then compares with the same state with the same number, it didn't do what I wanted to do

     ------------------------------------------
     year | state | count2 | count3 | count4|
     2014 | NJ    | 12332  | 54322  | 53422 |
     2014 | NJ    | no     | no     | no    |
     2015 | CO    | 12332  | 53255  | 55324 |
     2015 | MD    | no     | no     | 64524 |
     2016 | CT    | 14463  | 76543  | 66433 |
     2016 | CT    | 55325  | 76543  | 66433 |
     ------------------------------------------

我希望结果如上所示,如何实现?

I want the result to come out as above, how do I achieve that?

编辑,如果我只想在一个df中进行比较,那么在另一种情况下,col与cols我该怎么做? 喜欢

edits, also in a different scenario if I want to compare only in one df, col to cols how do I do that? like

 ------------------------------------------
 year | state | count2 | count3 | count4|
 2014 | NJ    | 12332  | 54322  | 53422 |

我想比较count3和count 4列与count2,显然cou​​nt3和count 4与count 2不匹配,所以我希望结果是

I want to compare count3 and count 4 cols to count2, obviously count3 and count 4 do not match count 2, so I want the result to be

-----------------------------------------------
 year | state | count2 | count3    | count4   |
 2014 | NJ    | 12332  | mismatch  | mismatch |

谢谢!

推荐答案

year上的数据框join不适用于您的mapDiffs方法.对于join,您需要在df1和df2中有一个行标识列.

The dataframe join on year won't work for your mapDiffs method. You need a row-identifying column in df1 and df2 for the join.

import org.apache.spark.sql.functions._

val df1 = Seq(
  ("2014", "NJ", "12332", "54322", "53422"),
  ("2014", "NJ", "12332", "53255", "55324"),
  ("2015", "CO", "12332", "53255", "55324"),
  ("2015", "MD", "14463", "76543", "64524"),
  ("2016", "CT", "14463", "76543", "66433"),
  ("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")

val df2 = Seq(
  ("2014", "NJ", "12332", "54322", "53422"),
  ("2014", "NJ", "12332", "53255", "125"),
  ("2015", "CO", "12332", "53255", "55324"),
  ("2015", "MD", "533",   "75",    "64524"),
  ("2016", "CT", "14463", "76543", "66433"),
  ("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")

如果在join的数据框中已经有一个行标识列(例如,rowId),请跳过此操作:

Skip this if you already have a row-identifying column (say, rowId) in the dataframes for thejoin:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val rdd1 = df1.rdd.zipWithIndex.map{
  case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1i = spark.createDataFrame( rdd1,
  StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)

val rdd2 = df2.rdd.zipWithIndex.map{
  case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2i = spark.createDataFrame( rdd2,
  StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)

现在,定义mapDiffs并将其按rowId联接数据框后将其应用于选定的列:

Now, define mapDiffs and apply it to the selected columns after joining the dataframes by rowId:

def mapDiffs(name: String) =
  when($"l.$name" === $"r.$name", $"l.$name").otherwise("no").as(name)

val cols = df1i.columns.filter(_.startsWith("count")).toList

val result = df1i.as("l").join(df2i.as("r"), "rowId").
  select($"l.rowId" :: $"l.year" :: cols.map(mapDiffs): _*)

// +-----+----+------+------+------+
// |rowId|year|count2|count3|count4|
// +-----+----+------+------+------+
// |    0|2014| 12332| 54322| 53422|
// |    5|2016| 55325| 76543| 66433|
// |    1|2014| 12332| 53255|    no|
// |    3|2015|    no|    no| 64524|
// |    2|2015| 12332| 53255| 55324|
// |    4|2016| 14463| 76543| 66433|
// +-----+----+------+------+------+

请注意,样本结果中df1和df2之间的差异似乎不止3个no点.我已经修改了样本数据,使这三个点唯一不同.

Note that there appears to be more discrepancies between df1 and df2 than just the 3 no-spots in your sample result. I've modified the sample data to make those 3 spots the only difference.

这篇关于如何在scala中比较两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆