高效COUNT DISTINCT与Apache星火 [英] Efficient Count Distinct with Apache Spark

查看:274
本文介绍了高效COUNT DISTINCT与Apache星火的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

100万客户点击几个网站的页面100十亿倍(假设100个网站)。而点击流是提供给您在一个大的数据集。

100 million customers click 100 billion times on the pages of a few web sites (let's say 100 websites). And the click stream is available to you in a large dataset.

使用Apache星火的抽象,什么是计算每个网站的不同访问者的最有效方法是什么?

Using the abstractions of Apache Spark, what is the most efficient way to count distinct visitors per website?

推荐答案

visitors.distinct()。COUNT()将是明显的方式,在第一方式不同可以指定并行的水平,也看到了速度的提高。如果能够设置访问者作为流和用D-流,这将做实时计数。您可以直接从一个目录流,并使用相同的方法对RDD喜欢的:

visitors.distinct().count() would be the obvious ways, with the first way in distinct you can specify the level of parallelism and also see improvement in the speed. If it is possible to set up visitors as a stream and use D-streams, that would do the count in realtime. You can stream directly from a directory and use the same methods as on the RDD like:

VAL文件= ssc.textFileStream(...)
file.distinct()。COUNT()

最后一个选项是使用高清countApproxDistinct(relativeSD:双= 0.05):长然而,这被标记为实验,但会显著快于计数,如果 relativeSD (标准差)为高。

Last option is to use def countApproxDistinct(relativeSD: Double = 0.05): Long however this is labelled as experimental, but would be significantly faster than count if relativeSD (std deviation) is higher.

编辑:既然你愿意,你可以只减少网站上的ID每个网站的数量,这可以有效地完成(含组合),因为计数总和。如果你的网站名称的用户ID元组的RDD你可以做。
visitors.countDistinctByKey() visitors.countApproxDistinctByKey(),再次约一个是实验性的。来者皆使用约不同,你需要一个<一个href=\"http://people.apache.org/~pwendell/spark-1.0.0-rc6-docs/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions\">PairRDD

Since you want the count per website you can just reduce on the website id, this can be done efficiently (with combiners ) since count is aggregate. If you have an RDD of website name user id tuples you can do. visitors.countDistinctByKey() or visitors.countApproxDistinctByKey(), once again the approx one is experimental. To use approx distinct by key you need a PairRDD

有意思的是,如果你都OK使用近似,想快速的结果你可能想看看 blinkDB 通过同人民火花放实验室做

Interesting side note if you are ok with approximations and want fast results you might want to look into blinkDB made by the same people as spark amp labs.

这篇关于高效COUNT DISTINCT与Apache星火的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆