Apache Spark:如何进行不同的计数并一起计数? [英] Apache Spark: how to do a distinct count and count together?
问题描述
我想同时进行一次计数和一个计数.让我通过一个简单的例子向您解释.这是我的输入数据:
I would like to do a distinct count and a count at the same time. Let me explain you through a simple example. Here is my input data:
arbre;pommier
fruit;pomme
fruit;pomme
fruit;fraise
fruit;ichigo
arbre;cerisier
arbre;abricotier
sport;foot
sport;rugby
arbre;abricotier
arbre;abricotier
arbre;bananier
fruit;ananas
sport;basket
arbre;abricotier
arbre;abricotier
作为输出,我想要得到:
As an output, I want to get that:
//type;count;distinct-count
arbre;9;3
fruit;5;4
sport;3;3
通过单词计数示例,我可以计算单词出现的时间(因此这里是计数).并使用 distinct().groupeByKey()
函数,可以计算出非重复计数(此处为非重复计数).
With the word count example, I can compute the number of time a word appears (so here is count). And with the function distinct().groupeByKey()
, I manage to compute the distinct count (here is distinct-count).
但是,我无法弄清楚如何在同一RDD而不是两个单独的RDD上做到这一点.
However, I can't figure out how to do that on the same RDD and not two separate RDDs.
如果您对此主题有想法,文档或链接,我将非常感谢.
If you have an idea, documentation or a link about this subject I would be very thankful.
推荐答案
数据:
col1,col2
arbre,pommier
fruit,pomme
fruit,pomme
fruit,fraise
fruit,ichigo
arbre,cerisier
arbre,abricotier
sport,foot
sport,rugby
arbre,abricotier
arbre,abricotier
arbre,bananier
fruit,ananas
sport,basket
arbre,abricotier
arbre,abricotier
使用 Spark 2
val df = sqlContext.read.option("header", "true").option("inferSchema", "true").csv("filelocation")
df.show
import sqlContext.implicits._
import org.apache.spark.sql.functions._
//Applying count and distinct count
df.groupBy("col1")
.agg(count("col2").alias("count"), countDistinct("col2").alias("distinct"))
.show
输出:
+-----+-----+--------+
| col1|count|distinct|
+-----+-----+--------+
|arbre| 8| 4|
|sport| 3| 3|
|fruit| 5| 4|
+-----+-----+--------+
这篇关于Apache Spark:如何进行不同的计数并一起计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!