如何比较Scala中不同的两个数据框和打印列 [英] How to compare two dataframe and print columns that are different in scala

查看:29
本文介绍了如何比较Scala中不同的两个数据框和打印列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有两个数据框:

预期的数据帧:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

和实际数据框:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

现在两个数据帧之间的区别是:

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
|     4| sanjose|  romino|9848022331|  45123|SanRamon|
+------+--------+--------+----------+-------+--------+

我们使用了except函数df1.except(df2),但问题是,它返回不同的整行.我们想要的是查看该行中哪些列不同(在这种情况下,emp_name"中的romin"和romino"是不同的).我们在这方面遇到了巨大的困难,任何帮助都会很棒.

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

推荐答案

从上面问题中描述的场景来看,似乎必须在列之间而不是行之间找到差异.

From the scenario that is described in the above question, it looks like that difference has to found between columns and not rows.

因此,为了做到这一点,我们需要在此处应用选择性差异,这将为我们提供具有不同值的列以及值.

So, in order to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

现在,要应用选择性差异,我们必须编写如下代码:

Now, to apply selective difference we have to write code something like this:

  1. 首先,我们需要找到预期和实际数据帧中的列.

  1. First we need to find the columns in expected and actual dataframes.

val 列 = df1.schema.fields.map(_.name)

val columns = df1.schema.fields.map(_.name)

  • 然后我们必须逐列找出差异.

  • Then we have to find difference columnwise.

    val selectionDifferences = columns.map(col => df1.select(col).except(df2.select(col)))

    val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))

  • 最后我们需要找出哪些列包含不同的值.

  • At last we need to find out which columns contains different values.

    selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

    selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

  • 而且,我们只会得到包含不同值的列.像这样:

    And, we will get only the columns which contains different values. Like this:

    +--------+
    |emp_name|
    +--------+
    |  romino|
    +--------+
    

    我希望这会有所帮助!

    这篇关于如何比较Scala中不同的两个数据框和打印列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆