什么是计数分钟草图?你会在什么时候使用它? [英] What is a count-min sketch? When would you use it?

查看:52
本文介绍了什么是计数分钟草图?你会在什么时候使用它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是,其中详细介绍了该主题.

如果您想同时支持增量和减量,会发生什么?您如何使用 count-min 草图来查找数据流中出现频率最高的元素?有关该主题的原始论文 是这里的一个很好的资源.

你能在没有随机性的情况下获得与 count-min 草图相同的结果吗?是的,使用一些巧妙的数论.

What is a count-min sketch? In what situations would it be useful?

解决方案

The Mile-High Overview

Intuitively, you can think of a count-min sketch as a space-efficient data structure for approximating how many times you've seen a given item in a data stream. From a client-side perspective, the count-min sketch supports two operations:

  • increment(x), which says "I've seen x another time," and
  • estimate(x), which says "please give me an estimate of how many times I've seen x."

If you just look at this interface, you might be thinking "isn't this something I could do with a hash table or a BST?" And you'd be right - you could just make a BST whose keys are the elements and whose values are the number of times each item is seen, or do the same with a hash table. If you have enough memory to keep track of all the keys you're encountering, this is not an unreasonable. But imagine that you're working on, say, an Amazon server and you're tracking views of each product. Suddenly, even writing down the names of all the pages that are visited is going to take a ton of memory. Or imagine you're Google and you want to find the most-commonly-visited pages each day. It would be prohibitively expensive to get enough RAM simply to write down all the search queries, and paging out to disk would be too slow.

What makes the count-min sketch shine is that the amount of memory needed by a count-min sketch can be tweaked by the user. If you give it more memory, it will produce better estimates of the true frequencies of the elements it's seen. If you give it less memory, it will produce lower-quality estimates, but with quantifiable guarantees about how likely it is for those estimates to be close to the true value. This means that if, say, you only have 1MB to dedicate to this purpose, you could get rough estimates of the frequencies, whereas if you have 128MB you could get dramatically better estimates.

The estimates given back by a count-min sketch will never underestimate the true frequencies of the elements. For example, if a count-min sketch says that an item has appeared 50 times, then the element might have appeared 50 times, or 49 times, or 48 times, etc., but it't not possible for it to have appeared 100 times. This makes the count-min sketch useful for finding high-frequency items: low-frequency items might have their frequencies overestimated a little bit, but high-frequency items will always appear popular.

The Internal Details

The count-min sketch is a fairly straightforward data structure to implement. The basic idea is the following. Imagine we have an array of counters, and we want to use that array to keep track of the frequencies of the items we're seeing. We'll use a hash function to assign each item to some counter (compute its hash code and mod it by the table size). Whenever we increment that item, we'll just go to the appropriate counter, then increment it. To provide an estimate, we'll just go to that counter and return the value stored there.

One way of thinking about how the count-min sketch works is to imagine each counter as a "bucket" holding all items of some type. We then estimate how frequent something is by seeing how many items are in its bucket, regardless of what those items are, as shown here:

As you can see, we get reasonably good estimates for frequent items, while infrequent items might have their frequencies grossly overestimated. This also gives a nice intuition as to why the count-min sketch never underestimates the frequencies of the elements. You'll always at least count the item itself when looking in its bucket.

If we make some reasonable assumptions about the hash functions we're using, we can formally prove that, on average, the estimate given back for an item's frequency is at most its actual frequency, plus the total number of items divided by the total number of counters. And that makes intuitive sense. If you have lots and lots of counters, at some point each item gets its own counter and the estimates will be perfectly accurate. On the other extreme, if you have almost no counters, then you'd expect all the items to be crammed into a small number of buckets, and the totals will start to be way off.

While this approach guarantees that on expectation the estimates returned will be good, that doesn't mean that you have a high probability of getting a good estimate. One way to think about this is to think about what it means to have a good estimate "on expectation." Intuitively, that would mean that if you were to build a bunch of these data structures, then the average of the estimates would probably be pretty good. However, any one individual estimate is not necessarily going to be all that good.

The count-min sketch takes this idea to heart and, instead of just having one array of counters, maintains several independent arrays of counters, each of which has a different hash function dropping items into buckets. This gives a degree of redundancy and means that it's highly unlikely that you'll get "unlucky" with items colliding in bad ways.

To get its overall estimate, the count-min sketch does something a bit more clever than just averaging the estimates together. Remember that the count-min sketch can never underestimate the actual frequencies of the items being stored, and can only overestimate them. This means that if you have a collection of different estimates coming back, the larger an estimate is, the worse it is. Why? Because the larger the estimate, the more items other than the one we care about it's counting. (Think back to buckets - if a bucket has a bunch of other items in it besides the one we care about, we don't want to use the size of that bucket as an estimate).

This is where the "min" part of "count-min sketch" comes in. When returning an estimate, the count-min sketch returns the minimum estimate among all the estimates generated. This is the estimate with the least noise in it. Intuitively, this is very likely to give you a good estimate - the only way it fails is if every single estimate was bad, which is fairly unlikely.

This means that the overall data structure, and the logic to manipulate it, is fairly straightforward:

Learning More

There's more to explore about the count-min sketch. For example, how do you formally analyze a count-min sketch to determine how many counters you need per row or how many independent structures you'll need? What kinds of hash functions can you use? To learn more about that, check out these lecture slides, which go into some detail on the topic.

What happens if you want to support both increments and decrements? How do you use the count-min sketch to find the most frequent elements in a data stream? The original paper on the topic is a good resource here.

Can you get the same results as a count-min sketch without randomness? Yes, using some clever number theory.

这篇关于什么是计数分钟草图?你会在什么时候使用它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆