亚马逊S3S背后的数据结构键(过滤数据结构) [英] Data Structure Behind Amazon S3s Keys (Filtering Data Structure)

查看:194
本文介绍了亚马逊S3S背后的数据结构键(过滤数据结构)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现类似亚马逊S3的查找功能的数据结构。对于那些你们谁不知道我是在怎么样,亚马逊S3存储所有文件的根目录,但允许您查找由普通prefixes文件组在他们的名字,因此复制的力量没有它的复杂性目录树。

I'd like to implement a data structure similar to the lookup functionality of Amazon S3. For those of you who don't know what I'm taking about, Amazon S3 stores all files at the root, but allows you to look up groups of files by common prefixes in their names, therefore replicating the power of a directory tree without the complexity of it.

美中不足的是,无论是查询和过滤操作是O(1)(或足够接近,即使在非常大的桶 - S3的磁盘当量 - 这两个操作很可能会成为O(1))。)

The catch is, both lookup and filter operations are O(1) (or close enough that even on very large buckets - S3's disk equivalents - both operations might as well be O(1))).

因此​​,在短期,我正在寻找一种数据结构,如哈希映射功能,具有高效的好处(至少是不为O(n))过滤。最好我能想出是扩大的HashMap,使其还包含内容(排序)名单,做一个二进制搜索相匹配的preFIX的范围内,并返回一个集。这似乎慢了我,但我想不出任何其他方式来做到这一点。

So in short, I'm looking for a data structure that functions like a hash map, with the added benefit of efficient (at the very least not O(n)) filtering. The best I can come up with is extending HashMap so that it also contains a (sorted) list of contents, and doing a binary search for the range that matches the prefix, and returning that set. This seems slow to me, but I can't think of any other way to do it.

有谁知道任亚马逊是怎么做的,或者更好的方法来实现这个数据结构?

Does anyone know either how Amazon does it, or a better way to implement this data structure?

推荐答案

只是为了验证我的要求,一个普通的TreeMap应该足够用多达1,000,000条记录任何水桶,这里有一个非常简单的测试案例,给出了一些数字(关注:这是不是意味着作为微基准,它只是为了得到这个问题的严重性的感觉)

Just to validate my claim that a regular TreeMap should suffice for any bucket with as many as 1,000,000 entries, here's a very simple test case that gives some numbers (attention: this is not meant as a microbenchmark, it's just to get a feeling for the magnitude of this problem).

我用随机生成的UUID模仿键(如果您想更换破折号用斜杠,你甚至会得到怎样的一个目录结构)。后来,我把它们放进一个普通的 java.util.TreeMap中最后询问他们 map.subMap(FROMKEY,TOKEY)

I've used randomly generated UUIDs to mimic keys (If you'd replace dashes with slashes, you would even get kind of a directory structure). Afterwards, I've put them into a regular java.util.TreeMap and finally queried them with map.subMap(fromKey, toKey).

public static void main(String[] args) {

    TreeMap<String, Object> map = new TreeMap<String, Object>();

    int count = 1000000;
    ArrayList<String> uuids;

    {
        System.out.print("generating ... ");
        long start = System.currentTimeMillis();
        uuids = new ArrayList<String>(count);
        for (int i = 0; i < count; i++) {
            uuids.add(UUID.randomUUID().toString());
        }
        System.out.println((System.currentTimeMillis() - start) + "ms");
    }

    {
        System.out.print("inserting .... ");
        long start = System.currentTimeMillis();

        Object o = new Object();
        for (int i = 0; i < count; i++) {
            map.put(uuids.get(i), o);
        }

        System.out.println((System.currentTimeMillis() - start) + "ms");
    }

    {
        System.out.print("querying ..... ");

        String from = "be400000-0000-0000-0000-000000000000";
        String to =   "be4fffff-ffff-ffff-ffff-ffffffffffff";

        long start = System.currentTimeMillis();

        long matches = 0;

        for (int i = 0; i < count; i++) {
            Map<String, Object> result = map.subMap(from, to);
            matches += result.size();
        }

        System.out.println((System.currentTimeMillis() - start) + "ms (" + matches/count
                + " matches)");

    }
}

和这里是我的机器的一些示例输出(1,000,000键,百万范围查询):

and here is some sample output from my machine (1,000,000 keys, 1,000,000 range queries):

generating ... 6562ms
inserting .... 2933ms
querying ..... 5344ms (229 matches)

插入1键(虽然肯定更加接近尾声)平均为0.003毫秒,而花了查询子范围有229场比赛了每次查询0.005毫秒。这是一些pretty的稳健表现,不是吗?

Inserting 1 key took an average of 0.003 ms (certainly more towards the end though) while querying a sub range with 229 matches took 0.005 ms per query. That's some pretty sane performance, isn't it?

数量的增加至10,000,000键和查询后,将号码如下:

After increasing the number to 10,000,000 keys and queries, the numbers are as follows:

generating ...  59562ms
inserting ....  47099ms
querying ..... 444119ms (2430 matches)

插入1键平均0.005毫秒了,同时查询子范围,2430场比赛拿了每次查询0.044毫秒。即使查询了10倍慢(在末尾,它通过它总是在一切比赛迭代(n))的表现还不是太糟糕要么。

Inserting 1 key took an average of 0.005 ms while querying a sub range with 2430 matches took 0.044 ms per query. Even though querying got 10 times slower (at the end, it's iterating through all matches which always is O(n)) the performance still isn't too bad either.

由于S3是一个云服务,我认为这是pretty的由反正联网太多的限制。因此,有没有迫切需要一个非常奇特的数据结构来获得所需的性能。不过,有些功能是缺失的,从我的测试情况下,最显着的并发性和持久性。不过,我想我已经证明,一个普通的树结构足以满足这个用例。如果你想为.subMap(FROMKEY,TOKEY)看中的东西,实验子树读写锁,也许更换;

As S3 is a cloud service, I'd assume that it's pretty much limited by networking anyway. Hence there's no urgent need for an extremely fancy data structure to get the required performance. Still, some features are missing from my test case, most notably concurrency and persistence. Nevertheless, I think I've shown that a regular tree structure is sufficient for this use case. If you want to do something fancy, experiment with subtree read-write locking and maybe a replacement for .subMap(fromKey, toKey);

这篇关于亚马逊S3S背后的数据结构键(过滤数据结构)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆