如何比较多行? [英] How to compare multiple rows?
问题描述
我想比较两个连续的行 i
与 col2
的 i-1
(按 col1
排序>).
I'd like to compare two consecutive rows i
with i-1
of col2
(sorted by col1
).
如果 i
-th 行的 item_i
和 item_[i-1]_row
不同,我想递增item_[i-1]
的计数加 1.
If item_i
of the i
-th row and the item_[i-1]_row
are different, I'd like to increment the count of item_[i-1]
by 1.
+--------------+
| col1 col2 |
+--------------+
| row_1 item_1 |
| row_2 item_1 |
| row_3 item_2 |
| row_4 item_1 |
| row_5 item_2 |
| row_6 item_1 |
+--------------+
在上面的例子中,如果我们一次向下扫描两行,我们看到 row_2
和 row_3
是不同的,因此我们向 item_1 添加一个.接下来,我们看到row_3
与row_4
不同,然后在item_2
上加一个.继续直到我们结束:
In the above example, if we scan two rows at a time downwards, we see that row_2
and row_3
are different therefore we add one to item_1. Next, we see that row_3
is different from row_4
, then add one to item_2
. Continue until we end up with:
+-------------+
| col2 col3 |
+-------------+
| item_1 2 |
| item_2 2 |
+-------------+
推荐答案
您可以结合使用窗口函数和聚合来执行此操作.窗口函数用于获取col2
的下一个值(使用col1
进行排序).然后聚合计算我们遇到差异的次数.这是在下面的代码中实现的:
You can use a combination of a window function and an aggregate to do this. The window function is used to get the next value of col2
(using col1
for ordering). The aggregate then counts the times we encounter a differences. This is implemented in the code below:
val data = Seq(
("row_1", "item_1"),
("row_2", "item_1"),
("row_3", "item_2"),
("row_4", "item_1"),
("row_5", "item_2"),
("row_6", "item_1")).toDF("col1", "col2")
import org.apache.spark.sql.expressions.Window
val q = data.
withColumn("col2_next",
coalesce(lead($"col2", 1) over Window.orderBy($"col1"), $"col2")).
groupBy($"col2").
agg(sum($"col2" =!= $"col2_next" cast "int") as "col3")
scala> q.show
17/08/22 10:15:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+----+
| col2|col3|
+------+----+
|item_1| 2|
|item_2| 2|
+------+----+
这篇关于如何比较多行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!