在关键服务器(数十亿个文件名)上对字符串进行内存受限的外部排序,并合并和计算重复项 [英] Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

查看:20
本文介绍了在关键服务器(数十亿个文件名)上对字符串进行内存受限的外部排序,并合并和计算重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的服务器在其日志文件夹中生成类似 {c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml 的文件.第一部分是GUID;第二部分是名称模板.

Our server produces files like {c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml in its log folder. The first part is GUID; the second part is name template.

我想计算具有相同名称模板的文件数.例如,我们有

I want to count the number of files with same name template. For instance, we have

{c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml
{aa3718d1-98e2-4559-bab0-1c69f04eb7ec}-hero.xml
{0c7a50dc-972e-4062-a60c-062a51c7b32c}-sign.xml

结果应该是

sign.xml,2
hero.xml,1

可能的名称模板总数未知,可能超过int.MaxValue.

The total kinds of possible name templates is unknown, possibly exceeds int.MaxValue.

服务器上的文件总数未知,可能超过int.MaxValue.

The total number of files on the server is unknown, possibly exceeds int.MaxValue.

要求:

最终结果应按名称模板排序.

The final result should be sorted by name template.

运行该工具的服务器非常关键.在不知道日志文件夹的任何特征的情况下,我们应该能够在运行该工具之前知道内存使用量 (MB) 和生成的临时文件的数量(如果有).

The server on which the tool is going to run is super critical. We should be able to tell memory usage (MB) and the number of temporary files generated, if any, before running the tool and without knowing any characteristic of the log folder.

我们使用 C# 语言.

We use C# language.

我的想法:

  • 对于前 5000 个文件,计算出现次数,将结果写入 Group1.txt.
  • 对于第二个 5000 个文件,计算出现次数,将结果写入 Group2.txt.
  • 重复,直到处理完所有文件.现在我们有一堆组文件.

然后我合并所有这些组文件.

Then I merge all these group files.

   Group1.txt     Group2.txt   Group3.txt     Group4.txt   
                   /                            /
       Group1-2.txt                Group3-4.txt
                                   /
                    Group1-4.txt

Group1-4.txt 为最终结果.

我和我朋友之间的分歧在于我们如何计算出现次数.

The disagreement between me and my friend is how we count the occurrences.

我建议使用字典.文件名模板是关键.令 m 为分区大小.(在这个例子中是 5000.)那么时间复杂度 O(m),空间复杂度 O(m).

I suggest to use dictionary. File name template is key. Let m be partition size. (In this example it's 5000.) Then time complexity O(m), space complexity O(m).

我的朋友建议对名称模板进行排序,然后计算一次传递中的出现次数,因为现在相同的名称模板都在一起了.时间复杂度 O(m log m),空间复杂度 O(m).

My friend suggests to sort the name template then count the occurrence in one pass as same name templates are all together now. time complexity O(m log m), space complexity O(m).

我们无法说服对方.你们看到这两种方法有什么问题吗?

We cannot persuade each other. Do you guys see any issues of the two methods?

推荐答案

IDK 如果已经研究了重复计数合并的外部排序.我确实找到了 1983 年的论文(见下文).通常,排序算法的设计和研究都是假设按键对对象进行排序,因此重复键具有不同的对象.可能有一些关于此的现有文献,但这是一个非常有趣的问题.可能只是考虑了紧凑字典结合外部归并排序的应用.

IDK if external sorting with count-merging of duplicates has been studied. I did find a 1983 paper (see below). Usually, sorting algorithms are designed and studied with the assumption of sorting objects by keys, so duplicate keys have different objects. There might be some existing literature on this, but it's a very interesting problem. Probably it's just considered an application of compact dictionaries combined with external merge-sorting.

在很少的内存中存储大量字符串的高效字典是一个很好研究的问题.大多数有用的数据结构都可以包含每个单词的辅助数据(在我们的例子中是重复计数).

Efficient dictionaries for storing large amounts of strings in little memory is a very well studied problem. Most of the useful data structures can include auxiliary data for each word (in our case, a dup count).

TL:DR 对有用想法的总结,因为我在这个答案的主体中对许多事情进行了过多的详细介绍:

TL:DR summary of useful ideas, since I rambled on in way too much detail about many things in the main body of this answer:

  • 当您的字典大小达到阈值时,而不是在输入文件数量固定后,批处理边界.如果一组 5000 个字符串中有很多重复项,您仍然不会使用太多内存.您可以通过这种方式在第一遍中找到更多重复项.

  • Batch boundaries when your dictionary size hits a threshold, not after fixed numbers of input files. If there were a lot of duplicates in a group of 5000 strings, you still won't be using very much memory. You can find way more duplicates in the first pass this way.

已排序的批次使合并大大.您可以并且应该合并 many->one 而不是二进制合并.使用 PriorityQueue 确定哪个输入文件包含您接下来应该使用的行.

Sorted batches makes merging much faster. You can and should merge many->one instead of binary merging. Use a PriorityQueue to figure out which input file has the line you should take next.

为了避免在对哈希表中的键进行排序时内存使用量激增,请使用可以按顺序遍历键的字典.(即即时排序.)有 SortedDictionary(基于二叉树).这也将排序的 CPU 使用率与等待获取输入字符串的 I/O 交织在一起.

To avoid a burst of memory usage when sorting the keys in a hash table, use a dictionary that can do an in-order traversal of keys. (i.e. sort on the fly.) There's SortedDictionary<TKey, TValue> (binary tree-based). This also interleaves the CPU usage of sorting with the I/O waiting to get the input strings.

按第一个字符(az,在 A 之前排序的非字母顺序,在 z 之后排序的非字母顺序)将每个批次的基数排序为输出).或者其他一些可以很好地分配密钥的分桶选择.为每个基数桶使用单独的字典,当您达到内存上限时,只将最大的字典清空到一个批次中.(比最大"更高级的驱逐启发式方法可能值得.)

Radix-sort each batch into outputs by first-character (a-z, non-alphabetic that sorts before A, and non-alphabetic that sorts after z). Or some other bucketing choice that distributes your keys well. Use separate dictionaries for each radix bucket, and empty only the biggest into a batch when you hit your memory ceiling. (fancier eviction heuristics than "biggest" may be worth it.)

限制 I/O(特别是在合并时),并检查系统 CPU 负载和内存压力.相应地调整行为以确保您不会在服务器最忙时造成影响.

throttle I/O (esp. when merging), and check system CPU load and memory pressure. Adapt behaviour accordingly to make sure you don't cause an impact when the server is most busy.

对于以 CPU 时间为代价的较小临时文件,请使用通用前缀编码,或者使用 lz4.

For smaller temp files at the cost of CPU time, use a common-prefix encoding, or maybe lz4.

对于相同的内存上限,节省空间的字典将允许更大的批量大小(因此更大的重复查找窗口).Trie(或者更好,Radix Trie) 可能是理想的,因为它将字符存储在树节点内,公共前缀只存储一次.有向无环词图更加紧凑(在不是前缀的常见子串之间找到冗余).使用一个作为字典很棘手,但可能是可能的(见下文).

A space-efficient dictionary will allow larger batch sizes (and thus a larger duplicate-finding window) for the same upper memory bound. A Trie (or better, Radix Trie) might be ideal, because it stores the characters within the tree nodes, with common prefixes only stored once. Directed Acyclic Word Graphs are even more compact (finding redundancy between common substrings that aren't prefixes). Using one as a Dictionary is tricky but probably possible (see below).

利用这样一个事实,即在您要清空整个字典之前不需要删除任何树节点或字符串.使用一个可增长的节点数组,以及另一个将字符串从头到尾打包的可增长字符数组.(对于 Radix Trie(多字符节点)有用,但不适用于每个节点是单个字符的常规 Trie.)

Take advantage of the fact that you don't need to delete any tree nodes or strings until you're going to empty the whole dictionary. Use a growable array of nodes, and another growable char array that packs strings head to tail. (Useful for a Radix Trie (multi-char nodes), but not a regular Trie where each node is a single char.)

根据重复项的分布方式,您可能会也可能不会在第一次通过时找到很多.这有一些影响,但并没有真正改变您最终合并的方式.

Depending on how the duplicates are distributed, you might or might not be able to find very many on the first pass. This has some implications, but doesn't really change how you end up merging.

我假设您有一些目录遍历的想法,它可以有效地为您的代码提供一个字符串流以进行唯一化和计数.所以我只会说字符串"或键"来讨论输入.

I'm assuming you have some directory traversal idea in mind, which can efficiently supply your code with a stream of strings to be uniquified and counted. So I'll just say "strings" or "keys", to talk about the inputs.

尽可能多地删除不必要的字符(例如,如果它们都是 .xml,则丢失 .xml).

Trim off as many unnecessary characters as possible (e.g. lose the .xml if they're all .xml).

在单独的机器上执行 CPU/内存密集型工作可能很有用,这取决于您拥有与关键生产服务器的快速网络连接的其他硬件.

It might be useful to do the CPU/memory intensive work on a separate machine, depending on what other hardware you have with a fast network connection to your critical production server.

您可以在服务器上运行一个简单的程序,该程序通过 TCP 连接将文件名发送到另一台机器上运行的程序,这样可以安全地使用更多内存.服务器上的程序仍然可以做小字典批处理,并将它们存储在远程文件系统上.

You could run a simple program on the server that sends filenames over a TCP connection to a program running on another machine, where it's safe to use much more memory. The program on the server could still do small dictionary batches, and just store them on a remote filesystem.

现在,由于其他答案都没有真正将所有部分放在一起,这是我的实际答案:

And now, since none of the other answers really put all the pieces together, here's my actual answer:

内存使用的上限很容易.编写程序以使用恒定内存上限,而不管输入大小.更大的输入将导致更多的合并阶段,而不是在任何时候更多的内存使用.

An upper bound on memory usage is easy. Write your program to use a constant memory ceiling, regardless of input size. Bigger inputs will lead to more merging phases, not more memory usage at any point.

在不查看输入的情况下对临时文件存储空间的最佳估计是一个非常 保守的上限,它假设每个输入字符串都是唯一的.您需要某种方法来估计将有多少输入字符串.(大多数文件系统都知道它们包含多少个单独的文件,而无需遍历目录树并计算它们.)

The best estimate of temporary file storage space you can do without looking at the input is a very conservative upper bound that assumes every input string is unique. You need some way to estimate how many input strings there will be. (Most filesystems know how many separate files they contain, without having to walk the directory tree and count them.)

您可以对重复项的分布做出一些假设,以做出更好的猜测.

You can make some assumptions about the distribution of duplicates to make a better guess.

如果临时文件的数量而不是大小成为问题,您可以将多个批次一个接一个地存储在同一个输出文件中.要么将长度标头放在每个的开头以允许逐批向前跳过,要么将字节偏移量写入单独的数据流.如果大小也很重要,请参阅我关于使用 frcode 样式的通用前缀压缩的段落.

If number, rather than size, of scratch files is an issue, you can store multiple batches in the same output file, one after another. Either put length-headers at the start of each to allow skipping forward by batch, or write byte offsets to a separate data stream. If size is also important, see my paragraph about using frcode-style common-prefix compression.

正如 Ian Mercer 在他的回答中指出的那样,对批次进行排序将使合并它们的效率更高.如果你不这样做,你要么冒着撞墙的风险,你的算法无法取得进展,要么你需要做一些事情,比如加载一批,扫描另一批中的第一批条目,然后用只是删除了可能很少的匹配条目.

As Ian Mercer points out in his answer, sorting your batches will make merging them much more efficient. If you don't, you either risk hitting a wall where your algorithm can't make forward progress, or you need to do something like load one batch, scan another batch for entries that are in the first, and rewrite the 2nd batch with just the potentially-few matching entries removed.

不对批次进行排序会使第一遍的时间复杂度为 O(N),但是要么您必须在稍后的某个时间进行排序,要么您的后期阶段的最坏情况界限会变得更糟.您希望对输出进行全局排序,因此除了 RadixSort 方法之外,在某处无法避免 O(N log N).

Not sorting your batches makes the time complexity of the first pass O(N), but either you have to sort at some point later, or your later stages have a worst-case bound that's dramatically worse. You want your output globally sorted, so other than RadixSort approaches, there's no avoiding an O(N log N) somewhere.

由于批量大小有限,预计合并步骤为 O(log N),因此您的原始分析忽略了编写阶段 1 批次后需要发生的事情,从而错过了方法的 O(N log N) 复杂性.

With limited batch size, O(log N) merge steps are expected, so your original analysis missed the O(N log N) complexity of your approach by ignoring what needs to happen after the phase1 batches are written.

适当的设计选择会发生很大变化,具体取决于我们的内存上限是否足够大以在一批中找到许多重复项.如果即使是像 Trie 这样复杂紧凑的数据结构也无济于事,那么将数据放入 Trie 并再次取出以写入批处理是浪费 CPU 时间.

The appropriate design choices change a lot depending on whether our memory ceiling is big enough to find many duplicates within one batch. If even a complex compact data structure like a Trie doesn't help much, putting the data into a Trie and getting it out again to write a batch is a waste of CPU time.

如果无论如何您都无法在每个批次中进行大量重复消除,那么您需要优化以将可能匹配的键放在一起以供下一阶段使用.您的第一阶段可以按第一个字节将输入字符串分组为最多 252 个左右的输出文件(并非所有 256 个值都是合法的文件名字符),或分成 27 个左右的输出文件(字母 + 杂项),或 26+26+1用于大写/小写 + 非字母.临时文件可以省略每个字符串的公共前缀.

If you can't do much duplicate-elimination within each batch anyway, then you need to optimize for putting possibly-matching keys together for the next stage. Your first stage could group input strings by first byte, into up-to 252 or so output files (not all 256 values are legal filename characters), or into 27 or so output files (alphabet + misc), or 26+26+1 for upper/lower case + non-alphabetic. Temp files can omit the common prefix from each string.

那么这些第一阶段批次中的大多数应该具有更高的重复密度.实际上,这种将输入分配到输出桶的基数分布在任何情况下都是有用的,见下文.

Then most of these first stage batches should have a much higher duplicate density. Actually, this Radix distribution of inputs into output buckets is useful in any case, see below.

您仍然应该将第一阶段的输出分块排序,以便为相同的 RAM 为下一次传递提供更宽的 dup-find 窗口.

You should still sort your first-stage outputs in chunks, to give the next pass a much wider dup-find window for the same RAM.

我将花更多时间在域上,您可以在初始流中找到大量有用的重复项,然后再使用约 100MiB 的 RAM,或者您选择的任何上限.

I'm going to spend more time on the domain where you can find a useful amount of duplicates in the initial stream, before using up ~100MiB of RAM, or whatever you choose as an upper limit.

显然,我们将字符串添加到某种字典中以动态查找和计算重复项,同时只需要为唯一字符串集提供足够的存储空间.仅仅存储字符串并然后对它们进行排序会显着降低效率,因为我们会在没有即时重复检测的情况下更快地达到 RAM 限制.

Obviously we add strings to some sort of dictionary to find and count duplicates on the fly, while only requiring enough storage for the set of unique strings. Just storing strings and then sorting them would be significantly less efficient, because we'd hit our RAM limit much sooner without on-the-fly duplicate detection.

为了最小化第 2 阶段的工作,第 1 阶段应该找到并计算尽可能多的重复项,从而减少 p2 数据的总大小.减少第 2 阶段的合并工作量也很好.更大的批次对这两个因素都有帮助,因此在阶段 1 中尽可能安全地接近内存上限非常有用.当您的内存消耗接近您选择的上限时,不要在恒定数量的输入字符串后编写批处理.重复的会被计数并丢弃,并且不会占用任何额外的存储空间.

To minimize the phase2 work, phase1 should find and count as many duplicates as possible, reducing the total size of the p2 data. Reducing the amount of merging work for phase2 is good, too. Bigger batches helps with both factors, so it's very useful to come as close to your memory ceiling as you safely can in phase1. Instead of writing a batch after a constant number of input strings, do it when your memory consumption nears your chosen ceiling. Duplicates are counted and thrown away, and don't take any extra storage.

准确内存记帐的另一种方法是跟踪字典中的唯一字符串,这很容易(并且由库实现为您完成).累积添加的字符串的长度也可以让您很好地估计用于存储字符串的内存.或者只是对字符串长度分布做一个假设.最初使您的哈希表大小合适,这样在您添加元素时它就不必增长,因此当它达到 60%(负载因子)或其他什么时停止.

An alternative to accurate memory accounting is tracking the unique strings in your dictionary, which is easy (and done for you by the library implementation). Accumulating the length of strings added can give you a good estimate of memory used for storing the strings, too. Or just make an assumption about string length distribution. Make your hash table the right size initially so it doesn't have to grow while you add elements, so you stop when it's 60% full (load factor) or something.

字典的节省空间的数据结构增加了给定内存限制的重复查找窗口.当它们的负载因子太高时,哈希表会变得非常低效,但哈希表本身只需要存储指向字符串的指针.它是最熟悉的字典,并且有一个库实现.

A space-efficient data structure for the dictionary increases our dup-finding window for a given memory limit. Hash tables get badly inefficient when their load factor is too high, but the hash table itself only has to store pointers to the strings. It's the most familiar dictionary and has a library implementations.

我们知道,一旦我们看到足够多的唯一键,我们就会希望对我们的批次进行排序,因此使用可以按排序顺序遍历的字典可能是有意义的.动态排序是有意义的,因为键会缓慢输入,由于我们从文件系统元数据中读取,因此受到磁盘 IO 的限制.一个缺点是,如果我们看到的大多数键都是重复的,那么我们将进行大量 O(log batchsize) 查找,而不是大量 O(1) 查找.并且当字典很大时,键很可能是重复的,因此大多数 O(log batchsize) 查询的批次大小将接近最大值,而不是均匀分布在 0 和最大值之间.一棵树为每次查找支付 O(log n) 的排序开销,无论键是否唯一.哈希表仅在删除重复项后才支付最后的排序成本.所以对于一棵树来说,它是 O(total_keys * log unique_keys),哈希表是 O(unique_keys * log unique_keys) 来排序一批.

We know we're going to want our batch sorted once we've seen enough unique keys, so it might make sense to use a dictionary that can be traversed in sorted order. Sorting on the fly makes sense because keys will come in slowly, limited by disk IO since we're reading from filesystem metadata. One downside is if most of the keys we see are duplicates, then we're doing a lot of O(log batchsize) lookups, rather than a lot of O(1) lookups. And it's more likely that a key will be a duplicate when the dictionary is big, so most of those O(log batchsize) queries will be with a batch size near max, not uniformly distributed between 0 and max. A tree pays the O(log n) overhead of sorting for every lookup, whether the key turned out to be unique or not. A hash table only pays the sorting cost at the end after removing duplicates. So for a tree it's O(total_keys * log unique_keys), hash table is O(unique_keys * log unique_keys) to sort a batch.

最大负载因子设置为 0.75 或其他值的哈希表可能非常密集,但在写出批处理之前必须对 KeyValuePair 进行排序可能会阻碍使用标准字典.您不需要字符串的副本,但您可能最终会将所有指针 (refs) 复制到临时空间以进行非就地排序,并且可能还会在排序之前将它们从哈希表中取出.(或者不仅仅是指针,KeyValuePair,以避免必须返回并查找哈希表中的每个字符串).如果大内存消耗的短暂峰值是可以容忍的,并且不会导致您将/页面交换到磁盘,那么您可能会没事.如果您可以在哈希表使用的缓冲区中进行就地排序,这是可以避免的,但我怀疑标准库容器会发生这种情况.

A hash table with max load factor set to 0.75 or something might be pretty dense, but having to sort the KeyValuePairs before writing out a batch probably puts a damper on using standard Dictionary. You don't need copies of the strings, but you'll probably end up copying all the pointers (refs) to scratch space for a non-in-place sort, and maybe also when getting them out of the hash table before sorting. (Or instead of just pointers, KeyValuePair, to avoid having to go back and look up each string in the hash table). If short spikes of big memory consumption are tolerable, and don't cause you to swap / page to disk, you could be fine. This is avoidable if you can do an in-place sort in the buffer used by the hash table, but I doubt that can happen with standard-library containers.

在速度键可用的情况下维持排序字典的持续 CPU 使用率可能比 CPU 使用率不频繁的突发更好,以对所有批次的键进行排序,除了内存消耗的爆发.

A constant trickle of CPU usage to maintain the sorted dictionary at the speed keys are available is probably better than infrequent bursts of CPU usage to sort all of a batch's keys, besides the burst of memory consumption.

.NET 标准库具有 SortedDictionary,文档说是用二叉树实现的.我没有检查它是否具有重新平衡功能,或使用红黑树来保证 O(log n) 最坏情况下的性能.我不确定它会有多少内存开销.如果这是一次性任务,那么我绝对建议使用它来快速轻松地实现它.以及用于重复使用的更优化设计的第一个版本.你可能会发现它已经足够好了,除非你能找到一个很好的 Tries 库实现.

The .NET standard library has SortedDictionary<TKey, TValue>, which the docs say is implemented with a binary tree. I didn't check if it has a rebalance function, or uses a red-black tree, to guarantee O(log n) worst case performance. I'm not sure how much memory overhead it would have. If this is a one-off task, then I'd absolutely recommend using this to implement it quickly and easily. And also for a first version of a more optimized design for repeated use. You'll probably find it's good enough, unless you can find a nice library implementation of Tries.

out 字典的内存效率越高,我们可以在必须写出一批并删除字典之前找到更多的 dup.此外,如果它是一个排序的字典,即使他们找不到重复项,我们的批次也可以更大.

The more memory efficient out dictionary is, the more dups we can find before having to write out a batch and delete the dictionary. Also, if it's a sorted dictionary, the larger our batches can be even when they can't find duplicates.

数据结构选择的次要影响是我们在关键服务器上运行时产生的内存流量.排序数组(具有 O(log n) 查找时间(二元搜索)和 O(n) 插入时间(混洗元素以腾出空间))将是紧凑的.但是,它不仅会很慢,而且会在很多时候用 memmove 使内存带宽饱和.这样做的 100% CPU 使用率比搜索二叉树的 100% CPU 使用率对服务器性能的影响更大.在加载当前节点之前,它不知道从哪里加载下一个节点,因此它无法管道内存请求.树搜索中比较的分支错误预测也有助于适度消耗所有内核共享的内存带宽.(没错,一些 100% CPU 使用率的程序比其他程序更糟糕!)

A secondary impact of data structure choice is how much memory traffic we generate while running on the critical server. A sorted array (with O(log n) lookup time (binary search), and O(n) insert time (shuffle elements to make room)) would be compact. However, it wouldn't just be slow, it would saturate memory bandwidth with memmove a lot of the time. 100% CPU usage doing this would have a bigger impact on the server's performance than 100% CPU usage searching a binary tree. It doesn't know where to load the next node from until it's loaded the current node, so it can't pipeline memory requests. The branch mispredicts of comparisons in the tree search also help moderate consumption of the memory bandwidth that's shared by all cores. (That's right, some 100%-CPU-usage programs are worse than others!)

如果清空我们的字典不会在我们清空它时留下内存碎片,那就太好了.不过,树节点的大小将是恒定的,因此一堆分散的孔将可用于未来的树节点分配.然而,如果我们有多个基数桶的单独字典(见下文),与其他字典关联的关键字符串可能会与树节点混合在一起.这可能会导致 malloc 很难重用所有已释放的内存,从而可能通过一些小因素增加实际的操作系统可见内存使用量.(除非 C# 运行时垃圾收集进行压缩,否则会处理碎片.)

It's nice if emptying our dictionary doesn't leave memory fragmented when we empty it. Tree nodes will be constant size, though, so a bunch of scattered holes will be usable for future tree node allocations. However, if we have separate dictionaries for multiple radix buckets (see below), key strings associated with other dictionaries might be mixed in with tree nodes. This could lead to malloc having a hard time reusing all the freed memory, potentially increasing actual OS-visible memory usage by some small factor. (Unless C# runtime garbage collection does compaction, in which case fragmentation is taken care of.)

因为您永远不需要删除节点,直到您想清空字典并将它们全部删除,所以您可以将 Tree 节点存储在一个可增长的数组中.所以内存管理只需要跟踪一个大的分配,与每个节点单独的 malloc 相比,减少了簿记开销.左/右子指针可以是数组索引,而不是真正的指针.这让您只能使用 16 或 24 位.(是另一种存储在数组中的二叉树,但它不能有效地用作字典.它是一棵树,但不是搜索树).

Since you never need to delete nodes until you want to empty the dictionary and delete them all, you could store your Tree nodes in a growable array. So memory management only has to keep track of one big allocation, reducing bookkeeping overhead compared to malloc of each node separately. Instead of real pointers, the left / right child pointers could be array indices. This lets you use only 16 or 24 bits for them. (A Heap is another kind of binary tree stored in an array, but it can't be used efficiently as a dictionary. It's a tree, but not a search tree).

存储字典的字符串键通常将每个字符串作为单独分配的对象完成,并在数组中指向它们.再一次,在您准备好将它们全部删除之前,您永远不需要删除、增长甚至修改它们,您可以将它们从头到尾打包在一个字符数组中,并在每个结尾处以零字节结尾.这再次节省了大量的簿记工作,并且还可以轻松跟踪关键字符串使用了多少内存,让您安全地接近您选择的内存上限.

Storing the string keys for a dictionary would normally be done with each String as a separately-allocated object, with pointers to them in an array. Since again, you never need to delete, grow, or even modify one until you're ready to delete them all, you can pack them head to tail in a char array, with a terminating zero-byte at the end of each one. This again saves a lot of book-keeping, and also makes it easy to keep track of how much memory is in use for the key strings, letting you safely come closer to your chosen memory upper bound.

为了更密集地存储一组字符串,我们可以消除存储每个字符串的所有字符的冗余,因为可能有很多共同的前缀.

For even denser storage of a set of strings, we can eliminate the redundancy of storing all the characters of every string, since there are probably a lot of common prefixes.

A Trie 将字符串存储在树结构中,为您提供通用前缀压缩.它可以按排序顺序遍历,因此它可以即时排序.每个节点都有与集合中不同的下一个字符一样多的子节点,因此它不是二叉树.AC# Trie 部分实现(删除未写入)可以在 this SO answer 中找到,与此类似但不是需要批处理/外部排序.

A Trie stores the strings in the tree structure, giving you common-prefix compression. It can be traversed in sorted order, so it sorts on the fly. Each node has as many children as there are different next-characters in the set, so it's not a binary tree. A C# Trie partial implementation (delete not written) can be found in this SO answer, to a question similar to this but not requiring batching / external sorting.

Trie 节点可能需要存储许多子指针,因此每个节点都可能很大.或者每个节点都可以是可变大小的,在节点内保存 nextchar:ref 对的列表,如果 C# 使这成为可能的话.或者正如维基百科文章所说,节点实际上可以是链表或二叉搜索树,以避免在子节点很少的节点中浪费空间.(树的较低级别将有很多.)需要词尾标记/节点来区分不是单独字典条目的子字符串和那些.我们的计数字段可以达到这个目的.Count=0 表示以此处结尾的子字符串不在字典中.count>=0 表示是.

Trie nodes need to store potentially many child pointers, so each node can be large. Or each node could be variable-size, holding the list of nextchar:ref pairs inside the node, if C# makes that possible. Or as the Wikipedia article says, a node can actually be a linked-list or binary search tree, to avoid wasting space in nodes with few children. (The lower levels of a tree will have a lot of that.) End-of-word markers / nodes are needed to distinguish between substrings that aren't separate dictionary entries, and ones that are. Our count field can serve that purpose. Count=0 means the substring ending here isn't in the dictionary. count>=0 means it is.

更紧凑的 Trie 是 基数树或 PATRICIA 树,它存储多个字符每个节点.

A more compact Trie is the Radix Tree, or PATRICIA Tree, which stores multiple characters per node.

这个想法的另一个扩展是确定性非循环有限状态自动机 (DAFSA),有时称为有向无环词图 (DAWG),但请注意,DAWG 维基百科文章是关于同名不同的东西.我不确定是否可以按排序顺序遍历 DAWG 以在最后取出所有键,并且正如维基百科指出的那样,存储关联数据(如重复计数)需要修改.我也不确定它们是否可以增量构建,但我认为您可以在不压缩的情况下进行查找.新添加的条目将像 Trie 一样存储,直到每 128 个新密钥的压缩步骤将它们合并到 DAWG 中.(或者,对于更大的 DAWG,不那么频繁地运行压缩,所以你不会做太多,比如在哈希表必须增长时将其大小加倍,而不是线性增长,以分摊昂贵的操作.)

Another extension of this idea is the Deterministic acyclic finite state automaton (DAFSA), sometimes called a Directed Acyclic Word Graph (DAWG), but note that the DAWG wikipedia article is about a different thing with the same name. I'm not sure a DAWG can be traversed in sorted order to get all the keys out at the end, and as wikipedia points out, storing associated data (like a duplicate count) requires a modification. I'm also not sure they can be built incrementally, but I think you can do lookups without having compacted. The newly added entries will be stored like a Trie, until a compaction step every 128 new keys merges them into the DAWG. (Or run the compaction less frequently for bigger DAWGs, so you aren't doing it too much, like doubling the size of a hash table when it has to grow, instead of growing linearly, to amortize the expensive op.)

当没有任何分支/收敛时,您可以通过在单个节点中存储多个字符来使 DAWG 更加紧凑.本页还提到了一种用于紧凑型 DAWG 的霍夫曼编码方法,并且有其他一些链接和文章引用.

You can make a DAWG more compact by storing multiple characters in a single node when there isn't any branching / converging. This page also mentions a Huffman-coding approach to compact DAWGs, and has some other links and article citations.

JohnPaul Adamovsky 的 DAWG 实现(C 语言) 看起来不错,并描述了一些它使用的优化.我没有仔细看它是否可以将字符串映射到计数.它被优化为将所有节点存储在一个数组中.

JohnPaul Adamovsky's DAWG implementation (in C) looks good, and describes some optimizations it uses. I haven't looked carefully to see if it can map strings to counts. It's optimized to store all the nodes in an array.

这个答案 1TB 文本问题中的重复计数词建议 DAWG,并且有几个链接,但我不确定它有多大用处.

This answer to the dup-count words in 1TB of text question suggests DAWGs, and has a couple links, but I'm not sure how useful it is.

您可以打开 RadixSort,并为每个起始字符(或 a-z、在 a 之前排序的非字母、在 z 之后排序的非字母)保留单独的字典.每个字典写出到不同的临时文件.如果您有多个计算节点可用于 MapReduce 方法,这将是将合并工作分配给计算节点的方式.

You could get your RadixSort on, and keep separate dictionaries for every starting character (or for a-z, non-alphabetic that sorts before a, non-alphabetic that sorts after z). Each dictionary writes out to a different temp file. If you have multiple compute nodes available for a MapReduce approach, this would be the way to distribute merging work to the compute nodes.

这允许一个有趣的修改:不是一次写入所有基数桶,只批量写入最大的字典.这可以防止每次小批量进入某些存储桶.这将减少每个桶内合并的宽度,加速阶段2.

This allows an interesting modification: instead of writing all radix buckets at once, only write the largest dictionary as a batch. This prevents tiny batches going into some buckets each time you. This will reduce the width of the merging within each bucket, speeding up phase2.

对于二叉树,这将每棵树的深度减少了大约 log2(num_buckets),从而加快了查找速度.对于 Trie,这是多余的(每个 节点使用下一个字符作为基数来对子树进行排序).使用 DAWG,这实际上会损害您的空间效率,因为您无法找到具有不同开头但后来共享部分的字符串之间的冗余.

With a binary tree, this reduces the depth of each tree by about log2(num_buckets), speeding up lookups. With a Trie, this is redundant (each node uses the next character as a radix to order the child trees). With a DAWG, this actually hurts your space-efficiency because you lose out on finding the redundancy between strings with different starts but later shared parts.

如果有几个不常接触的存储桶不断增长,但通常不会成为最大的存储桶,则这可能会表现不佳.它们可能会占用您总内存的很大一部分,从而从常用的存储桶中进行小批量生产.您可以实现一个更智能的驱逐算法,该算法记录上次清空存储桶(字典)的时间.桶的 NeedsEmptying 分数类似于大小和年龄的乘积.或者可能是年龄的一些函数,比如 sqrt(age).记录每个存储桶自上次清空以来发现的重复项数量的某种方法也很有用.如果您在输入流中的某个位置有一个存储桶有很多重复,那么您最不想做的就是经常清空它.也许每次在桶中发现重复项时,增加一个计数器.查看年龄与发现的重复次数的比率.当它们的大小开始增加时,坐在那里将 RAM 从其他存储桶中移走的低使用量存储桶很容易找到.如果发现大量重复项,即使是当前最大的桶,也可能会保留真正有价值的桶.

This has the potential to behave poorly if there are a few infrequently-touched buckets that keep growing, but don't usually end up being the largest. They could use up a big fraction of your total memory, making for small batches from the commonly-used buckets. You could implement a smarter eviction algorithm that records when a bucket (dictionary) was last emptied. The NeedsEmptying score for a bucket would be something like a product of size and age. Or maybe some function of age, like sqrt(age). Some way to record how many duplicates each bucket has found since last emptied would be useful, too. If you're in a spot in your input stream where there are a lot of repeats for one of the buckets, the last thing you want to do is empty it frequently. Maybe every time you find a duplicate in a bucket, increment a counter. Look at the ratio of age vs. dups-found. Low-use buckets sitting there taking RAM away from other buckets will be easy to find that way, when their size starts to creep up. Really-valuable buckets might be kept even when they're the current biggest, if they're finding a lot of duplicates.

如果您用于跟踪年龄和发现的 dups 的数据结构是数组结构,则可以有效地完成 (last_emptied[bucket] - current_pos)/(float)dups_found[bucket] 划分与向量浮点.一个整数除法比一个 FP 除法慢.1 个 FP 除法与 4 个 FP 除法的速度相同,如果您像这样让编译器变得容易,那么编译器有望自动矢量化.

If your data structures for tracking age and dups found is a struct-of-arrays, the (last_emptied[bucket] - current_pos) / (float)dups_found[bucket] division can be done efficiently with vector floating point. One integer division is slower than one FP division. One FP division is the same speed as 4 FP divisions, and compilers can hopefully auto-vectorize if you make it easy for them like this.

在装满水桶之间有很多工作要做,因此除非您使用很多的水桶,否则划分将是一个小问题.

There's a lot of work to do between buckets filling up, so division would be a tiny hiccup unless you use a lot of buckets.

如果有一个好的驱逐算法,一个理想的分桶选择将把很少有重复的键放在一些桶里,把有很多重复的桶放在其他桶里.如果您知道数据中的任何模式,这将是一种利用它的方法.拥有一些主要是低重复的存储桶意味着所有这些唯一的密钥不会将有价值的密钥冲走到输出批次中.一种驱逐算法会根据每个唯一键发现的重复数据来查看存储桶的价值,它会自动确定哪些存储桶有价值和值得保留,即使它们的大小在增加.

With a good eviction algorithm, an ideal choice of bucketing will put keys that rarely have duplicates together in some buckets, and buckets that have many duplicates together in other buckets. If you're aware of any patterns in your data, this would be a way to exploit it. Having some buckets that are mostly low-dup means that all those unique keys don't wash away the valuable keys into an output batch. An eviction algorithm that looks at how valuable a bucket has been in terms of dups found per unique key will automatically figure out which buckets are valuable and worth keeping, even though their size is creeping up.

很多 方法可以将您的字符串基数化到桶中.有些人会确保桶中的每个元素比每个后面的桶中的每个元素都少,因此生成完全排序的输出很容易.有些不会,但有其他优点.分桶选择之间需要权衡,所有这些都依赖于数据:

There are many ways to radix your strings into buckets. Some will ensure that every element in a bucket compares less than every element in every later bucket, so producing fully-sorted output is easy. Some won't, but have other advantages. There are going to be tradeoffs between bucketing choices, all of which are data-dependent:

  • 擅长在第一遍中找到大量重复项(例如,通过将高重复模式与低重复模式分开)
  • 在存储桶之间均匀分布批次数量(因此没有一个存储桶有大量需要在第 2 阶段进行多阶段合并的批次),以及其他因素.
  • 与数据集上的驱逐算法结合使用时会产生不良行为.
  • 生成全局排序输出所需的桶间合并量.其重要性与唯一字符串的总数有关,而不是输入字符串的数量.

我相信聪明的人在我之前已经想到了对字符串进行分桶的好方法,所以如果明显的按第一个字符的方法不理想,这可能值得研究.这种特殊用例(在消除/计算重复项的同时进行排序)并不典型.我认为大多数排序工作只考虑保留重复项的排序.因此,您可能找不到多少有助于为重复计数外部排序选择好的分桶算法的信息.无论如何,它将依赖于数据.

I'm sure clever people have thought about good ways to bucket strings before me, so this is probably worth searching on if the obvious approach of by-first-character isn't ideal. This special use-case (of sorting while eliminating/counting duplicates) is not typical. I think most work on sorting only considers sorts that preserve duplicates. So you might not find much that helps choose a good bucketing algorithm for a dup-counting external sort. In any case, it will be data-dependent.

分桶的一些具体选项是:基数 = 前两个字节在一起(仍然结合大写/小写,并结合非字母字符).或基数 = 哈希码的第一个字节.(需要全局合并以生成排序输出.)或者 Radix = (str[0]>>2) <<<6 + str[1]>>2.即忽略前 2 个字符的低 2 位,将 [abcd][abcd].* 放在一起,将 [abcd][efgh].* 放在一起,等等.这还需要在某些桶集之间合并排序结果.例如daxxx 将在第一个桶中,但 aexxx 将在第二个桶中.但是只有具有相同 first-char 高位的桶需要相互合并以产生排序的最终输出.

Some concrete-options for bucketing are: Radix = first two bytes together (still combining upper/lowercase, and combining non-alphabetic characters). Or Radix = the first byte of the hash code. (Requires a global-merge to produce sorted output.) Or Radix = (str[0]>>2) << 6 + str[1]>>2. i.e. ignore the low 2 bits of the first 2 chars, to put [abcd][abcd].* together, [abcd][efgh].* together, etc. This would also require some merging of the sorted results between some sets of buckets. e.g. daxxx would be in the first bucket, but aexxx would be in the 2nd. But only buckets with the same first-char high-bits need to be merged with each other to produce the sorted final output.

一个处理分桶选择的想法,它提供了很好的重复查找,但需要在存储桶之间进行合并排序:在写入阶段 2 输出时,将它的第一个字符作为基数进行存储,以生成您想要的排序顺序.作为全局排序的一部分,每个阶段 1 存储桶将输出分散到阶段 2 存储桶中.处理完所有可以包含以 a 开头的字符串的 phase1 批次后,将 a phase2-bucket 合并到最终输出中并删除这些临时文件.

An idea for handling a bucketing choice that gives great dup-finding but needs merge-sorting between buckets: When writing the phase2 output, bucket it with the first character as the radix to produce the sort order you want. Each phase1 bucket scatters output into phase2 buckets as part of the global sort. Once all the phase1 batches that can include strings starting with a have been processed, do the merge of the a phase2-bucket into the final output and delete those temp files.

基数 = 前 2 个字节(结合非字母字符)将构成 282 = 784 个桶.使用 200MiB 的 RAM,平均输出文件大小仅为 ~256k.一次只清空一个桶会使这个最小,而且你通常会得到更大的批次,所以这可以工作.(你的驱逐算法可能会遇到一个病态的情况,让它保留很多大桶,并为新的桶编写一系列小批量.如果你不仔细测试,聪明的启发式方法就会有危险.

Radix = first 2 bytes (combining non-alphabetic) would make for for 282 = 784 buckets. With 200MiB of RAM, that's average output file size of only ~256k. Emptying just one bucket at a time would make that the minimum, and you'd usually get larger batches, so this could work. (Your eviction algorithm could hit a pathological case that made it keep a lot of big buckets, and write a series of tiny batches for new buckets. There are dangers to clever heuristics if you don't test carefully).

将多个批次打包到同一个输出文件中可能对许多小桶最有用.你会有例如784 个输出文件,每个文件包含一系列批次.希望您的文件系统有足够的连续可用空间,并且足够聪明,以便在将小规模写入分散到许多文件时不会造成太严重的碎片.

Multiple batches packed into the same output file is probably most useful with many small buckets. You'll have e.g. 784 output files, each containing a series of batches. Hopefully your filesystem has enough contiguous free space, and is smart enough, to do a good job of not fragmenting too badly when scattering small-ish writes to many files.

在合并阶段,对于已排序的批次,我们不需要字典.只需从最低的批次中取出下一行,并在找到重复项时组合它们.

In the merging stages, with sorted batches we don't need a dictionary. Just take the next line from the batch that has the lowest, combining duplicates as you find them.

MergeSort 通常合并对,但在进行外部排序时(即磁盘 -> 磁盘),更广泛的输入是常见的,以避免多次读取和重写输出.打开 25 个输入文件以合并为一个输出文件应该没问题.使用 PriorityQueue 的库实现(通常实现为堆)从许多排序列表中选择下一个输入元素.可以添加以字符串为优先级的输入行,以计数和输入文件号作为有效负载.

MergeSort typically merges pairs, but when doing external sorting (i.e. disk -> disk), a much wider input is common to avoid reading and re-writing the output a lot of times. Having 25 input files open to merge into one output file should be fine. Use the library implementation of PriorityQueue (typically implemented as a Heap) to choose the next input element from many sorted lists. Maybe add input lines with the string as the priority, and the count and input file number as payload.

如果你在第一遍使用了基数分发,那么所有的a批次合并到最终的输出文件中(即使这个过程需要多个合并阶段),那么所有的b 批次等.您不需要检查来自 starts-with-a 存储桶的任何批次与来自任何其他存储桶的批次,所以这节省了很多的合并工作,尤其是.如果您的密钥按第一个字符分布良好.

If you used radix distribute-by-first-character in the first pass, then merge all the a batches into the final output file (even if this process takes multiple merging stages), then all the b batches, etc. You don't need to check any of the batches from the starts-with-a bucket against batches from any other bucket, so this saves a lot of merging work, esp. if your keys are well distributed by first character.

在合并期间限制磁盘 I/O,以避免在磁盘预取产生巨大的 I/O 队列读取深度时使您的服务器瘫痪.限制您的 I/O,而不是更窄的合并,可能是更好的选择.如果服务器忙于其正常工作,则可能.即使您只读取几个文件,也不会进行大量的连续读取.

Throttle disk I/O during merging, to avoid bringing your server to its knees if disk prefetch generates a huge I/O queue depth of reads. Throttling your I/O, rather than a narrower merge, is probably a better choice. If the server is busy with its normal job, it prob. won't be doing many big sequential reads even if you're only reading a couple files.

在运行时不时检查系统负载.如果它很高,请在做更多工作并再次检查之前休眠 1 秒.如果它真的很高,请不要再做任何工作,直到平均负载下降(两次检查之间休眠 30 秒).

Check the system load occasionally while running. If it's high, sleep for 1 sec before doing some more work and checking again. If it's really high, don't do any more work until the load average drops (sleeping 30sec between checks).

也要检查系统内存使用情况,如果生产服务器上的内存紧张,请降低批处理阈值.(或者,如果真的很紧张,请刷新您的部分批次并睡眠,直到记忆压力减轻.)

Check the system memory usage, too, and reduce your batch threshold if memory is tight on the production server. (Or if really tight, flush your partial batch and sleep until memory pressure reduces.)

如果临时文件大小有问题,您可以执行 通用前缀压缩,如来自 updatedb/locate 的 frcode,以显着减小字符串排序列表的文件大小.可能在批处理中使用区分大小写的排序,但不区分大小写的基数.So each batch in the a bucket will have all the As, then all the as.Or even LZ4 compress/decompress them on the fly.Use hex for the counts, not decimal.It's shorter, and faster to encode/decode.

If temp-file size is an issue, you could do common-prefix compression like frcode from updatedb/locate to significantly reduce the file size for sorted lists of strings. Probably use case-sensitive sorting within a batch, but case-insensitive radixing. So each batch in the a bucket will have all the As, then all the as. Or even LZ4 compress / decompress them on the fly. Use hex for the counts, not decimal. It's shorter, and faster to encode/decode.

Use a separator that's not a legal filename character, like /, between key and count. String parsing might well take up a lot of the CPU time in the merge stage, so it's worth considering. If you can leave strings in per-file input buffers, and just point your PQueue at them, that might be good. (And tell you which input file a string came from, without storing that separately.)

Use a separator that's not a legal filename character, like /, between key and count. String parsing might well take up a lot of the CPU time in the merge stage, so it's worth considering. If you can leave strings in per-file input buffers, and just point your PQueue at them, that might be good. (And tell you which input file a string came from, without storing that separately.)

performance tuning:

performance tuning:

If the initial unsorted strings were available extremely fast, then a hash table with small batches that fit the dictionary in the CPU L3 cache might be a win, unless a larger window can include a much larger fraction of keys, and find more dups. It depends on how many repeats are typical in say 100k files. Build small sorted batches in RAM as you read, then merge them to a disk batch. This may be more efficient than doing a big in-memory quicksort, since you don't have random access to the input until you've initially read it.

If the initial unsorted strings were available extremely fast, then a hash table with small batches that fit the dictionary in the CPU L3 cache might be a win, unless a larger window can include a much larger fraction of keys, and find more dups. It depends on how many repeats are typical in say 100k files. Build small sorted batches in RAM as you read, then merge them to a disk batch. This may be more efficient than doing a big in-memory quicksort, since you don't have random access to the input until you've initially read it.

Since I/O will probably be the limit, large batches that don't fit in the CPU's data cache are probably a win, to find more duplicates and (greatly?) reduce the amount of merging work to be done.

Since I/O will probably be the limit, large batches that don't fit in the CPU's data cache are probably a win, to find more duplicates and (greatly?) reduce the amount of merging work to be done.

It might be convenient to check the hash table size/memory consumption after every chunk of filenames you get from the OS, or after every subdirectory or whatever. As long as you choose a conservative size bound, and you make sure you can't go for too long without checking, you don't need to go nuts checking every iteration.

It might be convenient to check the hash table size / memory consumption after every chunk of filenames you get from the OS, or after every subdirectory or whatever. As long as you choose a conservative size bound, and you make sure you can't go for too long without checking, you don't need to go nuts checking every iteration.

This paper from 1983 examines external merge-sorting eliminating duplicates as they're encountered, and also suggests duplicate elimination with a hash function and a bitmap. With long input strings, storing MD5 or SHA1 hashes for duplicate-elimination saves a lot of space.

This paper from 1983 examines external merge-sorting eliminating duplicates as they're encountered, and also suggests duplicate elimination with a hash function and a bitmap. With long input strings, storing MD5 or SHA1 hashes for duplicate-elimination saves a lot of space.

I'm not sure what they had in mind with their bitmap idea. Being collision-resistant enough to be usable without going back to check the original string would require a hash code of too many bits to index a reasonable-size bitmap. (e.g. MD5 is a 128bit hash).

I'm not sure what they had in mind with their bitmap idea. Being collision-resistant enough to be usable without going back to check the original string would require a hash code of too many bits to index a reasonable-size bitmap. (e.g. MD5 is a 128bit hash).

这篇关于在关键服务器(数十亿个文件名)上对字符串进行内存受限的外部排序,并合并和计算重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆