如何检查属于两个dataframes行差异 [英] how to check differences in rows belonging to two dataframes
问题描述
我有两个数据帧时,重新present两个不同时期的时间相同的人。我想了解,对于每一行,如果出现了两个数据帧的5(固定)列的任何更改。
在
+ - + ------ + ------ + ------ + ------ + ------ + ------ +
| ID |体育| VAR1 | VAR2 | VAR3 | VAR4 | VAR5 |
+ - + ------ + ------ + ------ + ------ + ------ + ------ +
| 1 |足球| 330234 | | | | |
| 2 |足球|空|空|空|空|空|
| 3 |足球| 330101 | | | | |
| 4 |足球|空|空|空|空|空|
| 5 |足球|空|空|空|空|空|
| 6 |足球|空|空|空|空|空|
| 7 |足球|空|空|空|空|空|
| 8 |足球| 330024 | 330401 | | | |
| 9 |足球| 330055 | 330106 | | | |
| 10 |足球|空|空|空|空|空|
| 11 |足球| 390027 | | | | |
| 12 |足球|空|空|空|空|空|
| 13 |足球| 330101 | | | | |
| 14 |足球| 330059 | | | | |
| 15 |足球|空|空|空|空|空|
| 16 |足球| 140242 | 140281 | | | |
| 17 |足球| 330214 | | | | |
| 18 |足球| | | | | |
| 19 |足球| 330055 | 330196 | | | |
| 20 |足球| 210022 | | | | |
+ - + ------ + ------ + ------ + ------ + ------ + ------ +
在
+ - + ------ + ------ + ------ + ------ + ------ + ------ +
| ID |体育| VAR1 | VAR2 | VAR3 | VAR4 | VAR5 |
+ - + ------ + ------ + ------ + ------ + ------ + ------ +
| 1 |足球| 330234 | | | | |
| 2 |足球|空|空|空|空|空|
| 3 |足球| 330101 | | | | |
| 4 |足球|空|空|空|空|空|
| 5 |足球|空|空|空|空|空|
| 6 |足球|空|空|空|空|空|
| 7 |足球|空|空|空|空|空|
| 8 |足球|空|空|空|空|空|
| 9 |足球| 330106 | | | | |
| 10 |足球|空|空|空|空|空|
| 11 |足球| 390027 | | | | |
| 12 |足球|空|空|空|空|空|
| 13 |足球|空|空|空|空|空|
| 14 |足球| 330128 | 330331 | 330106 | 330059 | |
| 15 |足球|空|空|空|空|空|
| 16 |足球| 140242 | 140281 | 140010 | | |
| 17 |足球| 330214 | | | | |
| 18 |足球|空|空|空|空|空|
| 19 |足球| 330196 | | | | |
| 20 |足球| 210022 | | | | |
+ - + ------ + ------ + ------ + ------ + ------ + ------ +
我知道如何扫描中属于行的列的差异,但我pretty无能如何比较两个不同的数据帧的行。
这是理想的输出是:
+ - + ------ + ------ +
| ID |体育|差异|
+ - + ------ + ------ +
| 1 |足球| 0 |
| 2 |足球| 0 |
| 3 |足球| 0 |
| 4 |足球| 0 |
| 5 |足球| 0 |
| 6 |足球| 0 |
| 7 |足球| 0 |
| 8 |足球| 1 |
| 9 |足球| 1 |
| 10 |足球| 0 |
| 11 |足球| 0 |
| 12 |足球| 0 |
| 13 |足球| 1 |
| 14 |足球| 1 |
| 15 |足球| 0 |
| 16 |足球| 1 |
| 17 |足球| 0 |
| 18 |足球| 0 |
| 19 |足球| 1 |
| 20 |足球| 0 |
你的意思是这样呢?让我们开始示例数据:
VAL之前= SEQ(
(1,足球,有些(1),有些(2),部分(3),有些(4),无),
(2,足球,没有,有一些(0),无,无,有些(0)),
(3,足球,无,无,无,无,无)
).toDF(ID,运动,VAR1,VAR2,VAR3,VAR4,VAR5)VAL后= SEQ(
(1,足球,有些(1),有些(2),部分(3),有些(4),无),//零的diff
(2,足球,有些(1),有些(0),无,无,有些(0))//一个差异
(3,足球,有些(1),有些(1),有些(1),有些(1),有些(1))//五的diff
).toDF(ID,运动,VAR1,VAR2,VAR3,VAR4,VAR5)
生成一个前pression其中计数分歧:
//提取VAR列
VAL varCols = before.columns.drop(2)//生成exprs名单
// CAST(NOT(before.var1< = GT; after.var1)AS INT)
VAL equalsExprs = varCols.map(
C =>没有(COL(S之前,$ C)< =>西(S。之后,$ C))。CAST(INT)别名(S$ {c}里_ne))// SUM
VAL差异= equalsExprs.foldLeft(亮起(0))(_ + _)。别名(差异)
这将把:
- 双NULL值作为平等
- 任何值和NULL作为不等于
- 两个不到NULL值 - 标准型平等
加入并选择前pression:
VAL的diff = before.as(之前)。加入(after.as(之后),SEQ(ID,运动))
。选择($ID,$运动,差异)diffs.show// + --- + ------ + ---- +
// | ID |体育|差异|
// + --- + ------ + ---- +
// | 1 |足球| 0 |
// | 2 |足球| 1 |
// | 3 |足球| 5 |
// + --- + ------ + ---- +
I have two data frames, that represent two different period in times for the same people. I'd like to understand, for each row, if there have been any changes in the 5 (fixed) column of the two data frames.
Before:
+--+------+------+------+------+------+------+
|id| sport| var1| var2| var3| var4| var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234| | | | |
| 2|soccer| null| null| null| null| null|
| 3|soccer|330101| | | | |
| 4|soccer| null| null| null| null| null|
| 5|soccer| null| null| null| null| null|
| 6|soccer| null| null| null| null| null|
| 7|soccer| null| null| null| null| null|
| 8|soccer|330024|330401| | | |
| 9|soccer|330055|330106| | | |
|10|soccer| null| null| null| null| null|
|11|soccer|390027| | | | |
|12|soccer| null| null| null| null| null|
|13|soccer|330101| | | | |
|14|soccer|330059| | | | |
|15|soccer| null| null| null| null| null|
|16|soccer|140242|140281| | | |
|17|soccer|330214| | | | |
|18|soccer| | | | | |
|19|soccer|330055|330196| | | |
|20|soccer|210022| | | | |
+--+------+------+------+------+------+------+
After:
+--+------+------+------+------+------+------+
|id| sport| var1| var2| var3| var4| var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234| | | | |
| 2|soccer| null| null| null| null| null|
| 3|soccer|330101| | | | |
| 4|soccer| null| null| null| null| null|
| 5|soccer| null| null| null| null| null|
| 6|soccer| null| null| null| null| null|
| 7|soccer| null| null| null| null| null|
| 8|soccer| null| null| null| null| null|
| 9|soccer|330106| | | | |
|10|soccer| null| null| null| null| null|
|11|soccer|390027| | | | |
|12|soccer| null| null| null| null| null|
|13|soccer| null| null| null| null| null|
|14|soccer|330128|330331|330106|330059| |
|15|soccer| null| null| null| null| null|
|16|soccer|140242|140281|140010| | |
|17|soccer|330214| | | | |
|18|soccer| null| null| null| null| null|
|19|soccer|330196| | | | |
|20|soccer|210022| | | | |
+--+------+------+------+------+------+------+
I know how to scan for differences in columns belonging to a row, but I am pretty clueless how to compare rows of two different data frames.
An ideal output would be:
+--+------+------+
|id| sport| diff|
+--+------+------+
| 1|soccer| 0|
| 2|soccer| 0|
| 3|soccer| 0|
| 4|soccer| 0|
| 5|soccer| 0|
| 6|soccer| 0|
| 7|soccer| 0|
| 8|soccer| 1|
| 9|soccer| 1|
|10|soccer| 0|
|11|soccer| 0|
|12|soccer| 0|
|13|soccer| 1|
|14|soccer| 1|
|15|soccer| 0|
|16|soccer| 1|
|17|soccer| 0|
|18|soccer| 0|
|19|soccer| 1|
|20|soccer| 0|
Do you mean something like this? Lets start with example data:
val before = Seq(
(1, "soccer", Some(1), Some(2), Some(3), Some(4), None),
(2, "soccer", None, Some(0), None, None, Some(0)),
(3, "soccer", None, None, None, None, None)
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
val after = Seq(
(1, "soccer", Some(1), Some(2), Some(3), Some(4), None), // Zero diffs
(2, "soccer", Some(1), Some(0), None, None, Some(0)), // One diff
(3, "soccer", Some(1), Some(1), Some(1), Some(1), Some(1)) // Five diffs
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
Generate an expression which counts differences:
// Extract var columns
val varCols = before.columns.drop(2)
// Generate a list of exprs
// CAST(NOT(before.var1 <=> after.var1) AS INT)
val equalsExprs = varCols.map(
c => not(col(s"before.$c") <=> col(s"after.$c")).cast("int").alias(s"${c}_ne"))
// SUM
val diff = equalsExprs.foldLeft(lit(0))(_ + _).alias("diff")
It will treat:
- two NULLs as equal
- any value and NULL as not-equal
- two not-NULL values - standard type equality
Join and select the expression:
val diffs = before.as("before").join(after.as("after"), Seq("id", "sport"))
.select($"id", $"sport", diff)
diffs.show
// +---+------+----+
// | id| sport|diff|
// +---+------+----+
// | 1|soccer| 0|
// | 2|soccer| 1|
// | 3|soccer| 5|
// +---+------+----+
这篇关于如何检查属于两个dataframes行差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!