如何比较Scala中不同的两个数据框和打印列 [英] How to compare two dataframe and print columns that are different in scala

查看:174
本文介绍了如何比较Scala中不同的两个数据框和打印列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在这里有两个数据框:

We have two data frames here:

预期的数据帧:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

和实际数据框:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

现在两个数据框之间的区别是:

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
|     4| sanjose|  romino|9848022331|  45123|SanRamon|
+------+--------+--------+----------+-------+--------+

我们使用的是除外函数df1.except(df2),但这是一个问题,它返回的是不同的整行.我们想要的是查看该行中哪些列不同(在这种情况下,"emp_name"中的"romin"和"romino"不同).我们一直在遇到很大的困难,任何帮助都会很棒.

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

推荐答案

从上述问题中描述的情况来看,似乎必须在列而不是行之间找到区别.

From the scenario that is described in the above question, it looks like that difference has to found between columns and not rows.

因此,为了做到这一点,我们需要在此处应用选择性差异,这将为我们提供具有不同值的列以及这些值.

So, in order to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

现在,要应用选择性差异,我们必须编写如下代码:

Now, to apply selective difference we have to write code something like this:

  1. 首先,我们需要在预期和实际数据框中找到列.

  1. First we need to find the columns in expected and actual dataframes.

val列= df1.schema.fields.map(_.name)

val columns = df1.schema.fields.map(_.name)

  • 然后我们必须逐列查找差异.

  • Then we have to find difference columnwise.

    valselectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))

    val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))

  • 最后,我们需要找出哪些列包含不同的值.

  • At last we need to find out which columns contains different values.

    selectiveDifferences.map(diff => {if(diff.count> 0)diff.show})

    selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

  • 并且,我们将仅获得包含不同值的列.像这样:

    And, we will get only the columns which contains different values. Like this:

    +--------+
    |emp_name|
    +--------+
    |  romino|
    +--------+
    

    我希望这会有所帮助!

    这篇关于如何比较Scala中不同的两个数据框和打印列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆