计算 SPARKSQL 中重复行的数量 [英] Count number of duplicate rows in SPARKSQL

查看:27
本文介绍了计算 SPARKSQL 中重复行的数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在 SparkSQL 中为 Hive 表计算重复行的数量.

I have requirement where i need to count number of duplicate rows in SparkSQL for Hive tables.

from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row
app_name="test"
conf = SparkConf().setAppName(app_name)
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df = sqlContext.sql("select * from  DV_BDFRAWZPH_NOGBD_R000_SG.employee")

到目前为止,我已经对表名进行了硬编码,但它实际上是作为参数出现的.话虽如此,我们也不知道列数或其名称.在 python pandas 中,我们有类似 df.duplicated.sum() 的东西来计算重复记录的数量.我们这里有这样的东西吗?

As of now i have hardcoded the table name, but it actually comes as parameter. That being said we don't know the number of columns or their names as well.In python pandas we have something like df.duplicated.sum() to count number of duplicate records. Do we have something like this here?

+---+---+---+
| 1 | A | B |
+---+---+---+
| 1 | A | B |
+---+---+---+
| 2 | B | E |
+---+---+---+
| 2 | B | E |
+---+---+---+
| 3 | D | G |
+---+---+---+
| 4 | D | G |
+---+---+---+

此处重复行数为 4.(例如)

Here number of duplicate rows are 4. (for example)

推荐答案

您本质上想要groupBy()所有列和count(),然后选择计数大于1的行的计数总和.

You essentially want to groupBy() all the columns and count(), then select the sum of the counts for the rows where the count is greater than 1.

import pyspark.sql.functions as f
df.groupBy(df.columns)\
    .count()\
    .where(f.col('count') > 1)\
    .select(f.sum('count'))\
    .show()

说明

分组和聚合后,您的数据将如下所示:

After the grouping and aggregation, your data will look like this:

+---+---+---+---+
| 1 | A | B | 2 |
+---+---+---+---+
| 2 | B | E | 2 |
+---+---+---+---+
| 3 | D | G | 1 |
+---+---+---+---+
| 4 | D | G | 1 |
+---+---+---+---+

然后使用 where() 仅过滤计数大于 1 的行,并选择总和.在这种情况下,您将获得前 2 行,总和为 4.

Then use where() to filter only the rows with a count greater than 1, and select the sum. In this case, you will get the first 2 rows, which sum to 4.

这篇关于计算 SPARKSQL 中重复行的数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆