如何在HyperLogLog算法的工作吗? [英] How does the HyperLogLog algorithm work?

查看:406
本文介绍了如何在HyperLogLog算法的工作吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在学习在我的业余时间不同的算法近日,一个我碰到这似乎是很有趣的被称为HyperLogLog算法 - 这估计独特的项目有多少是在一个列表

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.

这是特别有趣的我,因为它把我带回我的MySQL天,当我看到基数值(我一直认为直到最近,它的计算未估计)。

This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).

所以我知道如何编写算法为O(n),将计算出独特的项目有多少是在数组中。我在Javascript写这个

So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in Javascript

function countUniqueAlgo1(arr) {
    var Table = {};
    var numUnique = 0;
    var numDataPoints = arr.length;
    for (var j = 0; j < numDataPoints; j++) {
        var val = arr[j];
        if (Table[val] != null) {
            continue;
        }
        Table[val] = 1;
        numUnique++;
    }
    return numUnique;
}

但问题是,我的算法,而O(N),使用了大量的内存(存储值)。

我一直在阅读本文对如何计算重复在O(n)的时间,使用最少的内存列表。 http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory. http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

据介绍,通过散列和计数位什么的人能在一定范围内的概率估算(假设列表中均匀分布)的唯一项目的数量在列表中。

It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.

我读过的文章,但我似乎无法理解。有人可以给出一个更外行的解释?我知道什么是哈希值,但我不明白它们是如何在这个HyperLogLog算法中使用。

I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.

推荐答案

我们用 HyperLogLog 算法相当广泛在我们的基础设施。因此,我们最终要解释给每个人,从高管到开发商。为了让我们的生活变得更轻松,我们已经提出了一个<一个href="https://web.archive.org/web/20150323055945/http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/">blog帖子它,已包括写在 D3 来帮助说明它模拟。我希望它能帮助!

We use the HyperLogLog algorithm quite extensively in our infrastructure. As a result we end up trying to explain it to everyone from executives to developers. To make our lives a little easier, we have put up a blog post on it and have included a simulation written in D3 to help illustrate it. I hope it helps!

这篇关于如何在HyperLogLog算法的工作吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆