ChronicleMap中的多地图 [英] Multimaps in ChronicleMap

查看:91
本文介绍了ChronicleMap中的多地图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ChronicleMap的GitHub 上,绝对没有关于ChronicleMap中多地图的免责声明:

There is definitely a disclaimer on ChronicleMap's GitHub about Multimaps in ChronicleMap:

编年史地图不是...

Chronicle Map is not...

...没有二级索引.

... No secondary indexes.

多图.在技​​术上可以将ChronicleMap<K, Collection<V>>用作多图,但通常会导致问题...

A multimap. Using a ChronicleMap<K, Collection<V>> as multimap is technically possible, but often leads to problems...

不幸的是,这是我的用例之一,为此(使用ChronicleMap)使用堆外存储无疑是最简单的方法.

Unfortunately, this is one of my use cases and using off-heap storage for that (with ChronicleMap) would certainly be the easiest route to do so.

让我尝试解释一下披萨的问题.我有100,000个不同的披萨.每个披萨都有一个ID,以及许多不同的浇头和硬皮.我有三种访问模式:

Let me try to explain my problem with pizzas. I have 100,000 different pizzas. Each pizza has an ID and lots of different toppings and crusts. I have three access patterns:

  • 通过ID给我披萨.
  • 请给我所有带有特殊馅料的披萨.
  • 给我所有有特殊皮的披萨.

我可以使用ChronicleMap<UUID,Pizza>轻松存储披萨.但这只是一种访问模式.我不想遍历每个披萨来查找具有匹配的馅料或结皮的披萨.因此,我想存储类似ChronicleMap<Topping,Collection<UUID>>ChronicleMap<Crust,Collection<UUID>>的东西.

I can store the pizzas easily using a ChronicleMap<UUID,Pizza>. But that's only one access pattern. I don't want to have to iterate through every pizza to find the ones with a matching topping or crust. So, I'd want to store something like ChronicleMap<Topping,Collection<UUID>> and ChronicleMap<Crust,Collection<UUID>>.

然后,如果有人问我所有的意大利辣香肠比萨饼,我会在ChronicleMap顶部查找以获取匹配的比萨饼的UUID,然后在主比萨饼地图中查找.

Then, if someone asks me for all the pizzas with pepperoni, I look up in the topping ChronicleMap to get the UUIDs of the matching pizzas, then in the main pizza map.

但是上面引用的文档使我感到恐惧.有人知道这些问题"是什么吗?这样的事情经常导致可能是?即使它似乎对我有用,为什么我不应该这样做呢?与ChronicleMap如何存储序列化的对象(特别是集合)有关吗?

But the documentation quoted above scares me. Does anyone know what these "problems" such a thing often leads to might be? Why should I not do this, even though it seems to be working for me? Does it have something to do with how ChronicleMap stores the serialized objects, specifically the collection?

一些有关潜在问题的注释:

A few additional notes for potential questions:

  1. 我们可能会在以后添加披萨,这也需要更新收藏夹.
  2. 许多进程正在尝试执行这些操作,因此需要通过ChronicleMap共享地图,而不仅仅是基本的ConcurrentMap.

推荐答案

如果实际数据确实类似于披萨,浇头和硬皮,则i. e.只有少数几个独特的浇头/硬皮,并且每个披萨都包含数千个披萨,我想说,在这种情况下,拥有适当的多重贴图是过大的,最好使pepperoni_pizzas.datonions_pizzas.dat,...与众不同具有UUID的可附加共享列表,您可以使用 Chronicle Queue 进行访问,并方便地从多个进程中进行更新.

If actual data indeed resemble pizzas, toppings and crusts, i. e. there are only a handful of distinct toppings/crusts, and thousands of pizzas contain each of them, I would say that having a proper multimap is overkill for this case and you would better have pepperoni_pizzas.dat, onions_pizzas.dat, ... distinct appendable shared lists with UUIDs, you can use Chronicle Queue for access and update them from multiple processes conveniently.

如果成千上万的浇头/硬皮有10s-100s,平均只有10s-100s的披萨有特定的浇头,那么您确实应该使用multimap.

If there are 10s-100s of thousands of toppings/crusts, only 10s-100s of pizzas have a specific topping on average, you should use multimap indeed.

基本上,Chronicle-Maps-as-multimaps存在3种问题":

Essentially there are 3 kinds of "problems" with Chronicle-Maps-as-multimaps:

如果使用值的List<UUID>Set<UUID>类型创建编年史地图而未指定自定义值序列化程序,则该方法有效,但效率极低,因为它将默认为内置Java序列化进行序列化和在每个请求上反序列化整个值集合,而不用对元素重用集合堆对象或单个UUID堆对象.因此,对ChronicleMap的每次请求都会产生大量垃圾.

If you create a Chronicle Map with List<UUID> or Set<UUID> type of value without specifying custom value serializers, it will work, but it will be utterly inefficient, because it will default to built-in Java serialization for serializing and deserializing the whole value collection on each request, without reusing neither collection heap objects, nor individual UUID heap objects for elements. Hence a lot of garbage will be generated on each request to the ChronicleMap.

解决方案 但是,如果您将值序列化程序指定为 ListMarshaller SetMarshaller (或您的自定义集合编组器,可以基于ListMarshallerSetMarshaller实现编写)与可重用的UUID堆对象一起使用,它将解决以下垃圾问题:

Solution However, if you specify value serializer as ListMarshaller or SetMarshaller (or your custom collection marshaller, which you could write based on ListMarshaller and SetMarshaller implementations) in conjunction with reusable UUID heap object, it will resolve this garbage problem:

ListMarshaller<ReusableUuid> valueMarshaller = ListMarshaller.of(
     ReusableUuidReader.INSTANCE, ReusableUuidWriter.INSTANCE);
List<ReusableUuid> averageValue = Stream
    .generate(() -> ReusableUuid.random())
    .limit(averagePizzasForTopping)
    .collect(Collectors.toList());
 ChronicleMap<Topping, List<ReusableUuid>> map = ChronicleMap
     .of(Topping.class, (Class<List<ReusableUuid>>) (Class) List.class)
     .averageKey(pepperoni)
     .valueMarshaller(valueMarshaller)
     .averageValue(averageValue)
     .entries(numberOfToppings)
     .createPersistedTo(new File("toppings_to_pizza_ids.dat"));

无效的价值更新和复制

当您将另一个比萨饼UUID附加到100个UUID的列表中,并将新值插入Chronicle Map时,Chronicle Map将再次重新写入整个列表,而不是将一个UUID附加到堆外内存的末尾块.而且,如果您使用复制,它将把100个UUID的整个列表作为更新值发送到其他节点,而不是仅发送一个添加的UUID.

Inefficient value updates and replication

When you append another pizza UUID to a list of 100 UUIDs, and insert the new value back to Chronicle Map, Chronicle Map will re-write the whole list again, instead of appending one UUID to the end of off-heap memory chunk. And if you use replication, it will send the whole list of 100 UUIDs as an updated value to other nodes, instead of sending just one added UUID.

两者(值更新和复制)都可以通过可怕的骇客进行优化,但是这需要对Chronicle Map的实施有非常深入的了解,并且非常脆弱.

Both (value update and replication) could be optimized via terrible hacks, but it requires very deep knowledge of Chronicle Map implementation and will be very fragile.

如果您计划在数据存储生命周期内添加新的披萨,则最初分配给整体的内存区域将变得太小而无法容纳具有更多UUID的新值,因此将重新分配内存区域(每个UUID列表可能会多次分配) . Chronicle Map的数据结构设计包含简化的内存分配方案,如果多次重新分配条目,则该方案将遭受碎片化的严重破坏.

If you plan add new pizzas during data store lifetime, memory areas initially allocated for entires will become too small to hold new values with more UUIDs, so memory areas will be re-allocated (possibly several times for each list of UUIDs). Chronicle Map's data stucture design implies simplified memory allocation scheme, which suffers badly from fragmentation, if entries are re-allocated many times.

如果列表中有很多UUID,并且在Linux上运行应用程序,则可以通过为每个条目预分配大量内存(实际上超出任何列表所需的内存)来缓解此问题.在ChronicleMapBuilder配置中指定.actualChunkSize())并依靠Linux的延迟映射内存分配功能(按需逐页).因此,每个UUID列表最多损失4KB内存,如果列表大小很多KB可能就可以了.

If you have a lot of UUIDs in lists, and you run your application on Linux, you can mitigate this problem by pre-allocating a lot of memory (more than will practically be needed by any list) for each entry (by specifying .actualChunkSize() in ChronicleMapBuilder configuration) and relying on Linux's feature of lazy mapped memory allocation (page-by-page, as needed). So you will lose at most 4KB of memory for each UUID list, that might be OK if lists are many KBs of size.

另一方面,如果您的列表很长(它们是UUID的列表,即小结构),并且总共只有10万个披萨,那么您首先不需要多图,请参阅开头这个答案.

On the other hand, if your lists are so long (and they are lists of UUIDs i. e. small structures), and you have only 100 000 pizzas in total, you don't need multimap in the first place, see the beginning of this answer.

在Linux中使用内存过量使用并依赖于惰性映射内存分配的技巧也适用于值的短列表(集合),但前提是元素本身很大,因此平均总值大小为许多KB.

The trick with memory overcommit and relying on lazy mapped memory allocation in Linux also would work for short list (collections) of values, but only if elements themselves are big, so that the average total value size is many KBs.

当您可以避免以任何其他方式避免对入口存储器进行重新分配时,碎片也不再是一个问题. e.新的披萨UUID会及时添加但也会被删除,因此,从上到下的UUID列表大小会浮动在某个平均值左右,并且很少会发生重新分配的情况.

Fragmentation is also less an issue when you can avoid entry memory re-allocation any other way, i. e. new pizza UUIDs are added in time but removed as well, so topping-to-uuids list sizes float around some average and re-allocation is rarely hitted.

如果将值插入历史记录地图"后再也不会更新(或大小不会更改),则内存碎片永远不会成为问题.

Memory fragmentation is never an issue if values are never updated (or never change in size) after entry is inserted into Chronicle Map.

在某些用例中,并通过适当的配置,编年史地图可以很好地用作多图.在其他情况下,编年史地图作为多地图本质上效率低下.

In some use cases and with proper configuration, Chronicle Map could serve as a multimap well. In other cases Chronicle Map as multimap is inherently inefficient.

重要因素:

  • 键总数->多图中的List<Value>个条目
  • 值总数
  • 密钥大小的平均和分布
  • 不同值大小的平均值和分布
  • 值列表大小的平均和分布
  • 值记录表在Chronicle Map生命周期中的动态变化(从不更新,仅附加,删除和附加.从列表的开头和中间删除更为昂贵.)
  • 是否复制了编年史地图
  • Total number of key -> List<Value> entries in a multimap
  • Total number of values
  • Average and distribution of key sizes
  • Average and distribution of distinct value sizes
  • Average and distribution of value list sizes
  • Value lists dynamics over Chronicle Map lifetime (never updated, append only, remove and append. Removes from beginning and middle of lists are more expensive.)
  • If Chronicle Map is replicated, or not

这篇关于ChronicleMap中的多地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆