Apache Spark:如何进行不同的计数并一起计数? [英] Apache Spark: how to do a distinct count and count together?

查看:84
本文介绍了Apache Spark:如何进行不同的计数并一起计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想同时进行一次计数和一个计数.让我通过一个简单的例子向您解释.这是我的输入数据:

I would like to do a distinct count and a count at the same time. Let me explain you through a simple example. Here is my input data:

arbre;pommier
fruit;pomme
fruit;pomme
fruit;fraise
fruit;ichigo
arbre;cerisier
arbre;abricotier
sport;foot
sport;rugby
arbre;abricotier
arbre;abricotier
arbre;bananier
fruit;ananas
sport;basket
arbre;abricotier
arbre;abricotier

作为输出,我想要得到:

As an output, I want to get that:

//type;count;distinct-count
arbre;9;3
fruit;5;4
sport;3;3

通过单词计数示例,我可以计算单词出现的时间(因此这里是计数).并使用 distinct().groupeByKey()函数,可以计算出非重复计数(此处为非重复计数).

With the word count example, I can compute the number of time a word appears (so here is count). And with the function distinct().groupeByKey(), I manage to compute the distinct count (here is distinct-count).

但是,我无法弄清楚如何在同一RDD而不是两个单独的RDD上做到这一点.

However, I can't figure out how to do that on the same RDD and not two separate RDDs.

如果您对此主题有想法,文档或链接,我将非常感谢.

If you have an idea, documentation or a link about this subject I would be very thankful.

推荐答案

数据:

col1,col2
arbre,pommier
fruit,pomme
fruit,pomme
fruit,fraise
fruit,ichigo
arbre,cerisier
arbre,abricotier
sport,foot
sport,rugby
arbre,abricotier
arbre,abricotier
arbre,bananier
fruit,ananas
sport,basket
arbre,abricotier
arbre,abricotier

使用 Spark 2

val df = sqlContext.read.option("header", "true").option("inferSchema", "true").csv("filelocation")

df.show

import sqlContext.implicits._

import org.apache.spark.sql.functions._

//Applying count and distinct count
df.groupBy("col1")
  .agg(count("col2").alias("count"), countDistinct("col2").alias("distinct"))
  .show

输出:

+-----+-----+--------+
| col1|count|distinct|
+-----+-----+--------+
|arbre|    8|       4|
|sport|    3|       3|
|fruit|    5|       4|
+-----+-----+--------+

这篇关于Apache Spark:如何进行不同的计数并一起计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆