解释伯爵素描算法 [英] Explaining The Count Sketch Algorithm

查看:223
本文介绍了解释伯爵素描算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释如何计数素描算法的工作?我仍然无法弄清楚如何哈希值的使用,例如。我有一个很难理解本文

Can someone explain how the Count Sketch Algorithm works? I still can't figure out how hashes are used, for example. I have a hard time understanding this paper.

推荐答案

这流算法实例下面的框架。

This streaming algorithm instantiates the following framework.

  1. 查找随机数据流的算法,它的输出(作为随机变量)具有所需的期望,但通常是高方差(即噪声)。

  1. Find a randomized streaming algorithm whose output (as a random variable) has the desired expectation but usually high variance (i.e., noise).

要降低方差/噪音,并行运行多个独立的副本,并结合它们的输出。

To reduce the variance/noise, run many independent copies in parallel and combine their outputs.

通常为1比2。更有趣的这个算法的2其实是有点不标准,但我要谈的1只。

Usually 1 is more interesting than 2. This algorithm's 2 actually is somewhat nonstandard, but I'm going to talk about 1 only.

假设我们正在处理输入

a b c a b a .

通过三个计数器,就没有必要凑。

With three counters, there's no need to hash.

a: 3, b: 2, c: 1

然而,让我们假设我们只有一个。有八种可能的功能 H:{A,B,C} - > {+ 1,-1} 。下面是结果的一个表

Let's suppose however that we have just one. There are eight possible functions h : {a, b, c} -> {+1, -1}. Here is a table of the outcomes.

 h  |
abc |  X = counter
----+--------------
+++ | +3 +2 +1 =  6
++- | +3 +2 -1 =  4
+-- | +3 -2 -1 =  0
+-+ | +3 -2 +1 =  2
--+ | -3 -2 +1 = -4
--- | -3 -2 -1 = -6
-+- | -3 +2 -1 = -2
-++ | -3 +2 +1 =  0

现在,我们可以计算出期望

Now we can calculate expectations

            (6 + 4 + 0 + 2) - (-4 + -6 + -2 + 0)
E[h(a) X] = ------------------------------------ = 24/8 = 3
                             8

            (6 + 4 + -2 + 0) - (0 + 2 + -4 + -6)
E[h(b) X] = ------------------------------------ = 16/8 = 2
                             8

            (6 + 2 + -4 + 0) - (4 + 0 + -6 + -2)
E[h(c) X] = ------------------------------------ =  8/8 = 1 .
                             8

这是怎么回事呢?对于 A ,比方说,我们可以分解 X = Y + Z ,其中是总和的变化 A s和以Z 的总和非 - A 秒。通过预期的线性度,我们有

What's going on here? For a, say, we can decompose X = Y + Z, where Y is the change in the sum for the as, and Z is the sum for the non-as. By the linearity of expectation, we have

E[h(a) X] = E[h(a) Y] + E[h(a) Z] .

E [H(一)Y] 是对 A 即每次出现一个学期的总和 H(一)^ 2 = 1 ,所以 E [H(一)Y] 是出现的次数 A 。其他期限 E [H(一)Z] 为零;即使考虑 H(一),对方散列值也同样可能是加上或减去一个,所以有助于零的预期。

E[h(a) Y] is a sum with a term for each occurrence of a that is h(a)^2 = 1, so E[h(a) Y] is the number of occurrences of a. The other term E[h(a) Z] is zero; even given h(a), each other hash value is equally likely to be plus or minus one and so contributes zero in expectation.

在事实上,散列函数并不需要是均匀的随机的,和良好的事:就没有的方式来存储它。就足够了散列函数是成对独立的(任何两个特定散列值是独立的)。对于我们的简单的例子,一个随机选择的以下四个功能就足够了。

In fact, the hash function doesn't need to be uniform random, and good thing: there would be no way to store it. It suffices for the hash function to be pairwise independent (any two particular hash values are independent). For our simple example, a random choice of the following four functions suffices.

abc

+++
+--
-+-
--+

我会离开这个新的计算给你。

I'll leave the new calculations to you.

这篇关于解释伯爵素描算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆