字符串的内存受限外部排序,重复的数据在重要的服务器上组合(数十亿个文件名) [英] Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

查看:178
本文介绍了字符串的内存受限外部排序,重复的数据在重要的服务器上组合(数十亿个文件名)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的服务器在其日志文件夹中生成 {c521c143-2a23-42ef-89d1-557915e2323a} -sign.xml 之类的文件。第一部分是GUID;第二部分是名称模板。



我想对具有相同名称模板的文件数进行计数。例如,我们有

  {c521c143-2a23-42ef-89d1-557915e2323a} -sign.xml 
{aa3718d1 -98e2-4559-bab0-1c69f04eb7ec} -hero.xml
{0c7a50dc-972e-4062-a60c-062a51c7b32c} -sign.xml

结果应该是

  sign.xml,2 
hero。 xml,1

可能的名称模板的总种类不明,可能超过 int.MaxValue



服务器上的文件总数不明,可能超过 int.MaxValue



要求



最终结果应为按名称模板排序。



该工具要运行的服务器是超级关键。我们应该能够在运行该工具之前告知内存使用量(MB)和生成的临时文件数(如果有的话),而不知道日志文件夹的任何特征。



我们使用C#语言。



我的想法




  • 对于前5000个文件,计算出现的次数,将结果写入 Group1.txt

  • 对于第二个5000个文件,计算出现情况,将结果写入 Group2.txt

  • 重复,直到所有文件被处理。现在我们有一堆组文件。



然后我合并所有这些组文件。

  Group1.txt Group2.txt Group3.txt Group4.txt 
\ / \ /
Group1-2.txt Group3-4.txt
\ /
Group1-4.txt

Group1-4.txt 是最终结果。



我和我的朋友之间的分歧是我们如何计算出现的次数。 >

我建议使用字典。文件名模板是关键。让m是分区大小。 (在本例中为5000)。然后时间复杂度O(m),空间复杂度O(m)。



我的朋友建议对名称模板进行排序,然后计算出现次数在一个通行证中,相同的名称模板现在在一起。时间复杂度O(m log m),空间复杂度O(m)。



我们不能说服彼此。你们看到这两种方法有什么问题吗?

解决方案

如果已经研究了具有计数合并重复项的外部排序,则IDK。我找到了一份1983年的文件(见下文)。通常,排序算法被设计和研究,假设按键排序对象,因此重复的键具有不同的对象。可能有一些现有的文献,但这是一个非常有趣的问题。可能它被认为是紧凑型字典与外部合并排序相结合的应用程序。



在小内存中存储大量字符串的高效字典是一个非常好的研究问题。大多数有用的数据结构可以包括每个单词的辅助数据(在我们的例子中是一个重复计数)。






TL :DR总结有用的想法,因为我在这个答案的主体中以太多的细节方式漫游了很多东西:




  • 当您的字典大小达到阈值时,而不是固定数量的输入文件之后的批量边界。如果一组5000个字符串中有很多重复的内容,那么您仍然不会使用非常多的内存。您可以通过这种方式在第一遍找到更多的重复项。


  • 排序的批次更快地合并 。您可以并且应该合并多个 - >一个而不是二进制合并。使用PriorityQueue来确定哪个输入文件具有下一步应该采用的行。


  • 为了避免在哈希表中排序密钥时出现内存使用突发,使用可以进行按键顺序遍历的字典。 (即时排序)。有 SortedDictionary< ; TKey,TValue> (基于二叉树)。这也使得排序的CPU使用量与等待获取输入字符串的I / O交织。


  • 根据第一个字符az,在 A 之前排序的非字母,和在 z 之后排序的非字母的字母)。或者一些其他的打击选择,分配你的键很好。为每个小桶使用单独的字典,并且当您打开内存天花板时,只能将最大的一个空格清空。 (爱好者驱逐启发式比最大可能值得。)


  • 油门I / O(特别是合并时),并检查系统CPU负载和内存压力。适应行为,以确保在服务器最忙时不会造成影响。


  • 对于较小的临时文件,以CPU时间为代价,请使用一个共同的前缀编码,或者可能是lz4。


  • 空间有效的字典将允许更大的批量大小(因此更大的重复查找窗口)相同的上限内存限制。 Trie (或更好的是, Radix Trie )可能是理想的,因为它将字符存储在树节点中,公共前缀仅存储一次。 定向无循环字图更紧凑(在不是前缀的公共子串之间找到冗余)。使用一个作为字典是棘手的,但可能是可能的(见下文)。


  • 利用您不需要删除任何树节点或字符串,直到你要清空整个字典。使用可扩展的节点数组,以及另外一个可扩展的字符数组,将字符串头到尾。 (对于一个基数(多个char节点),但不是一个常规的每个节点是单个字符)。


  • 根据重复项被分发,你可能或者可能不会在第一次通过中找到很多。这有一些含义,但并没有真正改变你最终合并的方式。







我假设你有一些目录遍历的想法,这可以有效地提供您的代码与一串字符串,以被独占和计数。所以我只是说字符串或键来谈论输入。



尽可能减少不必要的字符(例如丢失 .xml 如果他们都是 .xml )。






在独立的计算机上执行CPU /内存密集型工作可能会很有用,具体取决于与关键生产服务器的快速网络连接所具有的其他硬件。



您可以在服务器上运行一个简单的程序,通过TCP连接将文件名发送到另一台机器上运行的程序,可以安全使用更多的内存。服务器上的程序仍然可以执行小字典批处理,并将其存储在远程文件系统上。






现在,因为没有一个其他答案真的把所有的部分放在一起,这是我的实际答案:



内存使用的上限很容易。编写程序以使用恒定的内存上限,无论输入大小如何。更大的输入将导致更多的合并阶段,任何时候都不会有更多的内存使用。



可以在不查看输入的情况下对临时文件存储空间的最佳估计是每个输入字符串都是唯一的,保守的上限非常。您需要一些方法来估计有多少输入字符串。 (大多数文件系统知道它们包含多少个单独的文件,而不必走路径目录树并对它们进行计数。)



您可以对重复的分布进行一些假设,一个更好的猜测。



如果数字,而不是大小,暂存文件是一个问题,您可以将多个批次存储在同一个输出文件, 相继。在每个开始处放置长度头,以允许批量跳过,或者将字节偏移量写入单独的数据流。如果大小也很重要,请参阅我关于使用frcode风格的公用前缀压缩的段落。






正如Ian Mercer指出在他的答案中,对您的批次进行排序将使其合并更加高效。如果不这样做,您可能冒着击中墙壁的风险,即您的算法无法取得进展,或者您需要执行像加载一个批次的操作,扫描另一个批次以进入第一个条目,然后重写第二个批次只有可能的几个匹配的条目被删除。



不排序您的批次使得第一遍O(N)的时间复杂度,但是您必须在某个时间点排序以后,或者你的后期阶段,最坏的情况就会更糟。你希望你的输出全局排序,所以除了RadixSort方法之外,没有一个避免O(N log N)的地方。



有限的批量大小,O(log N)合并步骤是预期的,所以你原来的分析错过了你的方法的O(N log N)的复杂性,忽略了在第1阶段批次之后需要做什么。






适当的设计选择取决于我们的内存上限是否足够大,可以在一个批次中找到许多重复项。如果即使像Trie这样复杂的紧凑型数据结构也不会有太大的帮助,将数据放入Trie并再次出现批处理是浪费CPU时间。



如果您不能在每个批次中执行很多重复删除操作,那么您需要进行优化,将可能匹配的键放在一起。您的第一个阶段可以通过第一个字节将输入字符串分组到最多252个左右的输出文件(并非所有256个值都是合法的文件名字符),或分为27个左右的输出文件(字母+杂项)或26 + 26 + 1用于大写/小写+非字母。 Temp文件可以从每个字符串中省略共同的前缀。



然后这些第一批批次的大部分应该具有高得多的重复密度。实际上,输入到输出桶的这个基数分配在任何情况下都是有用的,见下文。



你还应该按照块的顺序对第一阶段的输出进行排序,通过一个更宽的dup-find窗口为同一个RAM。






我要在域上花更多的时间在使用最高〜100MiB的RAM或任何您选择的上限之前,您可以在初始流中找到有用的重复数据。



显然,我们添加字符串到某种类型的字典来查找和计数重复的动画,而只需要足够的存储为一组唯一的字符串。只需存储字符串和然后排序它们将显着降低效率,因为我们在没有即时重复检测的情况下更快地达到我们的RAM限制。



为了最小化相位2的工作,phase1应该尽可能多地重复和重复,减少p2数据的总大小。减少阶段2的合并数量也是很好的。 更大的批次有助于两个因素,所以非常有用的是尽可能接近你的内存上限,你可以安全地在阶段1。不要在一定数量的输入字符串之后编写批次,而是在内存消耗接近您选定的上限时执行。重复数据被计数并丢弃,不要再占用任何存储空间。



准确的内存计费的另一种方法是跟踪字典中的唯一字符串,这很简单(并由图书馆执行为您完成)。累加字符串的长度可以给你很好的估计用于存储字符串的内存。或者只是假设字符串长度分布。使您的哈希表最适合您的哈希表,以便您在添加元素时不必增长,因此,当60%满载(载入系数)或某物时,您将停止。






字典的空间高效的数据结构增加了给定内存限制的重复查找窗口。当哈希表的负载因子太高时,哈希表的效率会很低,但哈希表本身只需要存储指向字符串的指针。这是最熟悉的字典,并有一个库实现。



我们知道,一旦我们看到足够的唯一键,我们将要批量排序,所以可能会感觉使用可以按排序顺序遍历的字典。因为我们从文件系统元数据读取,所以在磁盘IO上受限制,因为键盘缓慢地进入,所以现场排序是有道理的。一个缺点是如果我们看到的大多数密钥是重复的,那么我们正在做很多O(日志批量化)查找,而不是大量的O(1)查找。而且当词典很大时,键的重要性更大,所以这些O(日志批量大小)查询中的大多数将以接近最大值的批量大小进行分类,而不是均匀分布在0和最大值之间。一棵树支付每次查找的排序O(log n)开销,无论密钥是否是唯一的。哈希表只能在删除重复之后支付最后的分拣成本。所以对于一个树,它是O(total_keys * log unique_keys),散列表是O(unique_keys * log unique_keys)来排序批处理。



最大负载因子的哈希表设置为0.75或某事可能相当密集,但是在写出批处理之前必须对 KeyValuePair 进行排序,这可能会对使用标准字典造成阻碍。您不需要字符串的副本,但您可能会最终将所有指针(引用)复制到临时空间以进行非就地排序,也可能在排序之前将其从哈希表中取出。 (或者,而不是仅仅是指针,KeyValuePair,以避免不得不返回并查找哈希表中的每个字符串)。如果大内存消耗的短时间是可以容忍的,并且不会导致您将/页面交换到磁盘,您可以很好。这是可以避免的,如果您可以在哈希表使用的缓冲区中进行就地排序,但我怀疑可能会发生标准库容器。



常量除了内存消耗的突发之外,CPU使用量的涓流使用量在速度键上维护排序的字典可能比不频繁的CPU使用情况要好一些批次的密钥。



.NET标准库具有 SortedDictionary< TKey,TValue> ,文档说的是用二叉树实现的。我没有检查它是否具有重新平衡功能,或者使用红黑树来保证O(log n)最坏情况的性能。我不知道有多少内存开销。如果这是一次性的任务,那么我绝对建议使用它来快速,轻松地实现它。并且还为第一个版本的更优化的设计重复使用。你可能会发现它很好,除非你可以找到一个很好的图书馆实现的Tries。






内存的数据结构低效的排序字典



更高的内存效率的字典是,在写出批处理和删除字典之前,我们可以找到更多的dup。另外,如果是排序字典,那么当我们的批次不能找到副本时,我们的批次也就越大。



数据结构选择的次要影响是多少内存流量我们在关键服务器上运行时生成。排序的数组(具有O(log n)查找时间(二进制搜索)和O(n)插入时间(混洗元素以创建空间))将是紧凑的。然而,它不会只是缓慢,它将饱和存储带宽与memmove很多时间。执行此操作的100%CPU使用率对服务器性能的影响将比使用二进制树的100%CPU用量更大。它不知道从哪里加载下一个节点,直到它加载当前节点,因此它不能管理内存请求。该分支对树搜索进行比较的误预测也有助于适度消耗所有内核共享的内存带宽。 (这是对的,大约100%的CPU使用程序比别人更糟糕!)



如果清空我们的字典,当我们将其清空时不会留下内存碎片。然而,树节点将是恒定的大小,因此一堆分散的空穴将可用于将来的树节点分配。但是,如果我们有多个基数桶的单独的字典(见下文),与其他字典关联的键字符串可能会与树结点混合。这可能导致malloc难以重用所有释放的内存,可能会通过一些小的因素增加实际的可视内存使用。 (除非C#运行时垃圾收集确实压缩,在这种情况下碎片被处理。)



由于您不需要删除节点,直到您要清空字典并删除他们都可以将您的Tree节点存储在可生成的数组中。因此,内存管理只需要跟踪一个大的分配,与每个节点的malloc分开相比,减少了簿记开销。而不是真正的指针,左/右子指针可以是数组索引。这使您只能使用16位或24位。 (是存储在数组中的另一种二叉树,但不能有效地使用作为一个字典,它是一棵树,但不是一个搜索树)。



存储字典的字符串键通常将在每个字符串作为单独分配的对象,其中指向数组的指针。再次,您永远不需要删除,增长,甚至修改一个,直到您准备好删除它们,您可以将它们打包到一个字符数组中,并在每个结尾处使用终止的零字节。这又节省了大量的书本,并且还可以轻松跟踪关键字符串使用的内存量,让您安全地接近您选择的内存上限。



Trie / DAWG用于更紧凑的存储



为了更紧密地存储一组字符串,我们可以消除存储所有字符的冗余每个字符串,因为可能有很多常见的前缀。



Trie 将字符串存储在树结构中,为您提供通用前缀压缩。它可以按顺序遍历,因此它可以在飞行中排序。每个节点具有与集合中不同的下一个字符一样多的子节点,因此它不是二叉树。 AC#Trie部分实现(删除未写入)可以在这个SO答案中找到一个类似于这个但不是这样的问题需要批量/外部分选。



Trie节点需要存储潜在的许多子指针,因此每个节点可能很大。或者每个节点可以是可变大小的,如果C#使得可能,则在节点内保存nextchar:ref对的列表。或者如维基百科的文章所说,一个节点实际上可以是一个链表或二进制搜索树,以避免在少数孩子的节点中浪费空间。 (一棵树的较低级别将会有很多。)需要字尾标记/节点来区分不是单独的字典条目的子字符串,而是区分字典条目的子字符串。我们的计数领域可以为此目的服务。 Count = 0表示此处结尾的子字符串不在字典中。 count> = 0表示它是。



更紧凑的Trie是基数树或PATRICIA树,每个节点存储多个字符。



这个想法的另一个扩展是确定性非循环有限状态自动机(DAFSA),有时称为定向无循环字图(DAWG),但请注意是与其名称相同的东西。我不知道DAWG可以按照排序顺序遍历,以便在结束时将所有密钥关闭,如维基百科指出,存储相关数据(如重复计数)需要修改。我也不确定它们是否可以逐步建立,但我认为您可以在没有压缩的情况下进行查找。新添加的条目将像Trie一样存储,直到压缩步骤每128个新密钥将它们合并到DAWG中。 (或者对于更大的DAWG,或者更少地运行压缩,所以你不会做太多事情,例如当哈希表必须增长时,加倍哈希表的大小,而不是线性增长,以摊销昂贵的操作。)



当没有任何分支/收敛时,您可以通过在单个节点中存储多个字符来使DAWG更紧凑。 此页面还提到了一种用于压缩DAWG的霍夫曼编码方法,并提供了一些其他链接和文章引用。



JohnPaul Adamovsky的DAWG实施(C)

这个答案对1TB的文本问题中的重复数字表示DAWG,并有几个链接,但我是不确定它是多么有用。






编写批次:基数第一个字符



您可以获得您的RadixSort,并为每个起始字符(或az,非字母排序,非字母排序z之后)分隔字典。每个字典都会写出一个不同的临时文件。如果您有多个计算节点可用于MapReduce方法,这将是将合并工作分配到计算节点的方式。



这允许有趣的修改:而不是写所有基数桶都是一次,只能将最大的字典写成批量。这样可以防止每次出现小批量进入某些水桶。这将减少每个存储区内合并的宽度,加速阶段2。



使用二叉树,这会减少每个树的深度约为log2(num_buckets)加快查找速度。使用Trie,这是多余的(每个节点使用下一个字符作为基数来排序子树)。使用DAWG,这实际上会损害您的空间效率,因为您失去了找到不同开始的字符串之间的冗余,但后来的共享部分。



这有可能表现如果有一些不常引人注目的水桶不断增长,但通常不会是最大的水桶,那么糟糕。它们可以占用您的总内存的很大一部分,从常用的桶中进行小批量生产。您可以实施一个更智能的驱逐算法,用于记录桶(字典)最后一次清空的时间。一个桶的需求清除分数将是大小和年龄的产物。或者也许有一些年龄的功能,像sqrt(年龄)。记录自上次清空以来每桶数量有多少重复的方法也是有用的。如果您在输入流中有一个位置,其中一个存储桶有很多重复,您最后要做的是经常清空。也许每次在一个桶中找到一个副本时,增加一个计数器。看看年龄与重复发现的比例。坐在那里的低使用水桶将远离其他水桶,容易发现,当他们的大小开始爬升。如果他们发现了很多重复,那么真正有价值的桶可能会保留下来。如果他们发现了很多重复的话,那么它们可能会保留下来。



如果你的数据结构跟踪年龄和重复发现是一个结构的数组,可以使用向量浮点来有效地执行(last_emptied [bucket] - current_pos)/(float)dups_found [bucket] 一个整数除法比一个FP分割慢。一个FP部门的速度与4个FP部门的速度是一样的,编译器可以希望自动向量化,如果你这样做很容易。



有很多工作要做在桶填满之前,除非您使用桶的批次

,否则分割将是一个小小的打嗝。



选择如何存储



有了一个很好的驱逐算法,一个理想的选择是将一些桶中很少有重复的密钥和其他桶中的重复数据组合在一起。如果您知道数据中的任何模式,这将是一种利用它的方式。有些桶大多是低拷贝意味着所有这些唯一的密钥都不会将有价值的密钥清除到输出批次中。一个驱逐的算法,考虑到每个唯一键所发现的重复数据桶的价值,将自动计算哪些存储桶是有价值的,值得保留的,即使它们的大小正在爬升。



有许多方式将您的字符串缩放到桶中。有些将确保桶中的每个元素比每个后续的桶中的每个元素少,因此生成完全排序的输出很容易。有些不会,但有其他优点。在竞争选择之间有权衡,所有这些都取决于数据:




  • 很高兴在第一个通过(例如通过将high-dup模式与low-dup模式分开)

  • 在桶之间均匀分配批次数(所以没有数据桶有大量的批次需要多级合并在第2阶段),也可能是其他因素。

  • 与您的数据集上的逐出算法相结合时,会产生不良行为。

  • 需要合并来生成全球排序的输出。这个比例的重要性与唯一字符串的总数,而不是输入字符串的数量。



我确信聪明的人有想到了在我之前将字符串串的好方法,所以这可能值得搜索,如果第一个字符的明显方法是不理想的。这种特殊用例(排除/排除/重复计数)并不典型。我认为大多数排序工作只考虑保留重复的排序。所以你可能找不到很多帮助选择一个很好的扣除算法来进行重复计数的外部排序。在任何情况下,它都将取决于数据。



一些具体的选项是:Radix =前两个字节在一起(仍然组合大写/小写,字母字符)。或者Radix =哈希码的第一个字节。 (需要一个全局合并来产生排序的输出。)或者Radix = (str [0]>>> 2)< 6 + str [1]>> 2< / code>。即忽略前2个字符的低2位,将 [abcd] [abcd]。* 在一起, [abcd] [efgh]。 * 在一起等等。这也需要一些合并一些桶之间的排序结果。例如 daxxx 将在第一个桶中,但 aexxx 将在第二个。但是只有具有相同第一位高位的桶才需要相互合并才能产生排序的最终输出。



巨大的重复发现但需要在桶之间进行合并排序:在写入phase2输出时,将第一个字符作为基数进行存储,以生成所需的排序顺序。每个阶段1桶将输出作为全局排序的一部分分散到阶段2桶中。一旦可以包含以 a 开始的字符串的所有第1阶段批处理已经被处理完毕,请执行 a phase2-桶进入最终输出并删除这些临时文件。



基数=前2个字节(非字母组合)将用于28 2 = 784桶。具有200MiB的RAM,平均输出文件大小只有〜256k。一次清空一个桶就可以使这个最小化,而且你通常会收到更大的批次,所以这可以工作。 (你的驱逐算法可能会造成一个病态,使得它保留了很多大桶,并为新的桶写了一系列小批量,如果你不仔细测试,会有一些巧妙的启发式的危险)。



多个批次包装在同一个输出文件中可能是许多小桶最有用的。你会有例如784个输出文件,每个包含一批批次。希望您的文件系统具有足够的连续可用空间,并且足够聪明,以便在将小写写入许多文件时,做得不好分散。






合并:



在合并阶段,使用排序批次,我们不需要一个字典。只需从批次中选择最低的批次,将其重新组合起来。



MergeSort通常合并对,但当进行外部排序(即磁盘 - >磁盘)时,通常会有更广泛的输入,以避免读取和重写输出很多次。有25个输入文件打开合并到一个输出文件应该是正常的。使用PriorityQueue的库实现(通常实现为堆)从许多排序列表中选择下一个输入元素。可能添加输入行的字符串作为优先级,计数和输入文件编号作为有效负载。



如果您在第一个通过,然后将所有一个批次合并到最终的输出文件中(即使这个过程需要多个合并阶段),那么所有的 b 批次等。您不需要从任何其他桶的批次中检查起始位置中的任何批次 - a / strong>,所以这样可以节省合并工作的时间。






最大限度地减少对生产服务器的影响:



Throttle disk I/O during merging, to avoid bringing your server to its knees if disk prefetch generates a huge I/O queue depth of reads. Throttling your I/O, rather than a narrower merge, is probably a better choice. If the server is busy with its normal job, it prob. won’t be doing many big sequential reads even if you’re only reading a couple files.



Check the system load occasionally while running. If it’s high, sleep for 1 sec before doing some more work and checking again. If it’s really high, don’t do any more work until the load average drops (sleeping 30sec between checks).



Check the system memory usage, too, and reduce your batch threshold if memory is tight on the production server. (Or if really tight, flush your partial batch and sleep until memory pressure reduces.)



If temp-file size is an issue, you could do common-prefix compression like frcode from updatedb/locate to significantly reduce the file size for sorted lists of strings. Probably use case-sensitive sorting within a batch, but case-insensitive radixing. So each batch in the a bucket will have all the As, then all the as. Or even LZ4 compress / decompress them on the fly. Use hex for the counts, not decimal. It’s shorter, and faster to encode/decode.



Use a separator that’s not a legal filename character, like /, between key and count. String parsing might well take up a lot of the CPU time in the merge stage, so it’s worth considering. If you can leave strings in per-file input buffers, and just point your PQueue at them, that might be good. (And tell you which input file a string came from, without storing that separately.)






performance tuning:



If the initial unsorted strings were available extremely fast, then a hash table with small batches that fit the dictionary in the CPU L3 cache might be a win, unless a larger window can include a much larger fraction of keys, and find more dups. It depends on how many repeats are typical in say 100k files. Build small sorted batches in RAM as you read, then merge them to a disk batch. This may be more efficient than doing a big in-memory quicksort, since you don’t have random access to the input until you’ve initially read it.



Since I/O will probably be the limit, large batches that don’t fit in the CPU’s data cache are probably a win, to find more duplicates and (greatly?) reduce the amount of merging work to be done.



It might be convenient to check the hash table size / memory consumption after every chunk of filenames you get from the OS, or after every subdirectory or whatever. As long as you choose a conservative size bound, and you make sure you can’t go for too long without checking, you don’t need to go nuts checking every iteration.






This paper from 1983 examines external merge-sorting eliminating duplicates as they’re encountered, and also suggests duplicate elimination with a hash function and a bitmap. With long input strings, storing MD5 or SHA1 hashes for duplicate-elimination saves a lot of space.



I’m not sure what they had in mind with their bitmap idea. Being collision-resistant enough to be usable without going back to check the original string would require a hash code of too many bits to index a reasonable-size bitmap. (e.g. MD5 is a 128bit hash).


Our server produces files like {c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml in its log folder. The first part is GUID; the second part is name template.

I want to count the number of files with same name template. For instance, we have

{c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml
{aa3718d1-98e2-4559-bab0-1c69f04eb7ec}-hero.xml
{0c7a50dc-972e-4062-a60c-062a51c7b32c}-sign.xml

The result should be

sign.xml,2
hero.xml,1

The total kinds of possible name templates is unknown, possibly exceeds int.MaxValue.

The total number of files on the server is unknown, possibly exceeds int.MaxValue.

Requirements:

The final result should be sorted by name template.

The server on which the tool is going to run is super critical. We should be able to tell memory usage (MB) and the number of temporary files generated, if any, before running the tool and without knowing any characteristic of the log folder.

We use C# language.

My idea:

  • For the first 5000 files, count the occurrences, write result to Group1.txt.
  • For the second 5000 files, count the occurrences, write result to Group2.txt.
  • Repeat until all files are processed. Now we have a bunch of group files.

Then I merge all these group files.

   Group1.txt     Group2.txt   Group3.txt     Group4.txt   
       \            /            \                /
       Group1-2.txt                Group3-4.txt
                  \                 /
                    Group1-4.txt

Group1-4.txt is the final result.

The disagreement between me and my friend is how we count the occurrences.

I suggest to use dictionary. File name template is key. Let m be partition size. (In this example it's 5000.) Then time complexity O(m), space complexity O(m).

My friend suggests to sort the name template then count the occurrence in one pass as same name templates are all together now. time complexity O(m log m), space complexity O(m).

We cannot persuade each other. Do you guys see any issues of the two methods?

解决方案

IDK if external sorting with count-merging of duplicates has been studied. I did find a 1983 paper (see below). Usually, sorting algorithms are designed and studied with the assumption of sorting objects by keys, so duplicate keys have different objects. There might be some existing literature on this, but it's a very interesting problem. Probably it's just considered an application of compact dictionaries combined with external merge-sorting.

Efficient dictionaries for storing large amounts of strings in little memory is a very well studied problem. Most of the useful data structures can include auxiliary data for each word (in our case, a dup count).


TL:DR summary of useful ideas, since I rambled on in way too much detail about many things in the main body of this answer:

  • Batch boundaries when your dictionary size hits a threshold, not after fixed numbers of input files. If there were a lot of duplicates in a group of 5000 strings, you still won't be using very much memory. You can find way more duplicates in the first pass this way.

  • Sorted batches makes merging much faster. You can and should merge many->one instead of binary merging. Use a PriorityQueue to figure out which input file has the line you should take next.

  • To avoid a burst of memory usage when sorting the keys in a hash table, use a dictionary that can do an in-order traversal of keys. (i.e. sort on the fly.) There's SortedDictionary<TKey, TValue> (binary tree-based). This also interleaves the CPU usage of sorting with the I/O waiting to get the input strings.

  • Radix-sort each batch into outputs by first-character (a-z, non-alphabetic that sorts before A, and non-alphabetic that sorts after z). Or some other bucketing choice that distributes your keys well. Use separate dictionaries for each radix bucket, and empty only the biggest into a batch when you hit your memory ceiling. (fancier eviction heuristics than "biggest" may be worth it.)

  • throttle I/O (esp. when merging), and check system CPU load and memory pressure. Adapt behaviour accordingly to make sure you don't cause an impact when the server is most busy.

  • For smaller temp files at the cost of CPU time, use a common-prefix encoding, or maybe lz4.

  • A space-efficient dictionary will allow larger batch sizes (and thus a larger duplicate-finding window) for the same upper memory bound. A Trie (or better, Radix Trie) might be ideal, because it stores the characters within the tree nodes, with common prefixes only stored once. Directed Acyclic Word Graphs are even more compact (finding redundancy between common substrings that aren't prefixes). Using one as a Dictionary is tricky but probably possible (see below).

  • Take advantage of the fact that you don't need to delete any tree nodes or strings until you're going to empty the whole dictionary. Use a growable array of nodes, and another growable char array that packs strings head to tail. (Useful for a Radix Trie (multi-char nodes), but not a regular Trie where each node is a single char.)

  • Depending on how the duplicates are distributed, you might or might not be able to find very many on the first pass. This has some implications, but doesn't really change how you end up merging.


I'm assuming you have some directory traversal idea in mind, which can efficiently supply your code with a stream of strings to be uniquified and counted. So I'll just say "strings" or "keys", to talk about the inputs.

Trim off as many unnecessary characters as possible (e.g. lose the .xml if they're all .xml).


It might be useful to do the CPU/memory intensive work on a separate machine, depending on what other hardware you have with a fast network connection to your critical production server.

You could run a simple program on the server that sends filenames over a TCP connection to a program running on another machine, where it's safe to use much more memory. The program on the server could still do small dictionary batches, and just store them on a remote filesystem.


And now, since none of the other answers really put all the pieces together, here's my actual answer:

An upper bound on memory usage is easy. Write your program to use a constant memory ceiling, regardless of input size. Bigger inputs will lead to more merging phases, not more memory usage at any point.

The best estimate of temporary file storage space you can do without looking at the input is a very conservative upper bound that assumes every input string is unique. You need some way to estimate how many input strings there will be. (Most filesystems know how many separate files they contain, without having to walk the directory tree and count them.)

You can make some assumptions about the distribution of duplicates to make a better guess.

If number, rather than size, of scratch files is an issue, you can store multiple batches in the same output file, one after another. Either put length-headers at the start of each to allow skipping forward by batch, or write byte offsets to a separate data stream. If size is also important, see my paragraph about using frcode-style common-prefix compression.


As Ian Mercer points out in his answer, sorting your batches will make merging them much more efficient. If you don't, you either risk hitting a wall where your algorithm can't make forward progress, or you need to do something like load one batch, scan another batch for entries that are in the first, and rewrite the 2nd batch with just the potentially-few matching entries removed.

Not sorting your batches makes the time complexity of the first pass O(N), but either you have to sort at some point later, or your later stages have a worst-case bound that's dramatically worse. You want your output globally sorted, so other than RadixSort approaches, there's no avoiding an O(N log N) somewhere.

With limited batch size, O(log N) merge steps are expected, so your original analysis missed the O(N log N) complexity of your approach by ignoring what needs to happen after the phase1 batches are written.


The appropriate design choices change a lot depending on whether our memory ceiling is big enough to find many duplicates within one batch. If even a complex compact data structure like a Trie doesn't help much, putting the data into a Trie and getting it out again to write a batch is a waste of CPU time.

If you can't do much duplicate-elimination within each batch anyway, then you need to optimize for putting possibly-matching keys together for the next stage. Your first stage could group input strings by first byte, into up-to 252 or so output files (not all 256 values are legal filename characters), or into 27 or so output files (alphabet + misc), or 26+26+1 for upper/lower case + non-alphabetic. Temp files can omit the common prefix from each string.

Then most of these first stage batches should have a much higher duplicate density. Actually, this Radix distribution of inputs into output buckets is useful in any case, see below.

You should still sort your first-stage outputs in chunks, to give the next pass a much wider dup-find window for the same RAM.


I'm going to spend more time on the domain where you can find a useful amount of duplicates in the initial stream, before using up ~100MiB of RAM, or whatever you choose as an upper limit.

Obviously we add strings to some sort of dictionary to find and count duplicates on the fly, while only requiring enough storage for the set of unique strings. Just storing strings and then sorting them would be significantly less efficient, because we'd hit our RAM limit much sooner without on-the-fly duplicate detection.

To minimize the phase2 work, phase1 should find and count as many duplicates as possible, reducing the total size of the p2 data. Reducing the amount of merging work for phase2 is good, too. Bigger batches helps with both factors, so it's very useful to come as close to your memory ceiling as you safely can in phase1. Instead of writing a batch after a constant number of input strings, do it when your memory consumption nears your chosen ceiling. Duplicates are counted and thrown away, and don't take any extra storage.

An alternative to accurate memory accounting is tracking the unique strings in your dictionary, which is easy (and done for you by the library implementation). Accumulating the length of strings added can give you a good estimate of memory used for storing the strings, too. Or just make an assumption about string length distribution. Make your hash table the right size initially so it doesn't have to grow while you add elements, so you stop when it's 60% full (load factor) or something.


A space-efficient data structure for the dictionary increases our dup-finding window for a given memory limit. Hash tables get badly inefficient when their load factor is too high, but the hash table itself only has to store pointers to the strings. It's the most familiar dictionary and has a library implementations.

We know we're going to want our batch sorted once we've seen enough unique keys, so it might make sense to use a dictionary that can be traversed in sorted order. Sorting on the fly makes sense because keys will come in slowly, limited by disk IO since we're reading from filesystem metadata. One downside is if most of the keys we see are duplicates, then we're doing a lot of O(log batchsize) lookups, rather than a lot of O(1) lookups. And it's more likely that a key will be a duplicate when the dictionary is big, so most of those O(log batchsize) queries will be with a batch size near max, not uniformly distributed between 0 and max. A tree pays the O(log n) overhead of sorting for every lookup, whether the key turned out to be unique or not. A hash table only pays the sorting cost at the end after removing duplicates. So for a tree it's O(total_keys * log unique_keys), hash table is O(unique_keys * log unique_keys) to sort a batch.

A hash table with max load factor set to 0.75 or something might be pretty dense, but having to sort the KeyValuePairs before writing out a batch probably puts a damper on using standard Dictionary. You don't need copies of the strings, but you'll probably end up copying all the pointers (refs) to scratch space for a non-in-place sort, and maybe also when getting them out of the hash table before sorting. (Or instead of just pointers, KeyValuePair, to avoid having to go back and look up each string in the hash table). If short spikes of big memory consumption are tolerable, and don't cause you to swap / page to disk, you could be fine. This is avoidable if you can do an in-place sort in the buffer used by the hash table, but I doubt that can happen with standard-library containers.

A constant trickle of CPU usage to maintain the sorted dictionary at the speed keys are available is probably better than infrequent bursts of CPU usage to sort all of a batch's keys, besides the burst of memory consumption.

The .NET standard library has SortedDictionary<TKey, TValue>, which the docs say is implemented with a binary tree. I didn't check if it has a rebalance function, or uses a red-black tree, to guarantee O(log n) worst case performance. I'm not sure how much memory overhead it would have. If this is a one-off task, then I'd absolutely recommend using this to implement it quickly and easily. And also for a first version of a more optimized design for repeated use. You'll probably find it's good enough, unless you can find a nice library implementation of Tries.


Data structures for memory-efficient sorted dictionaries

The more memory efficient out dictionary is, the more dups we can find before having to write out a batch and delete the dictionary. Also, if it's a sorted dictionary, the larger our batches can be even when they can't find duplicates.

A secondary impact of data structure choice is how much memory traffic we generate while running on the critical server. A sorted array (with O(log n) lookup time (binary search), and O(n) insert time (shuffle elements to make room)) would be compact. However, it wouldn't just be slow, it would saturate memory bandwidth with memmove a lot of the time. 100% CPU usage doing this would have a bigger impact on the server's performance than 100% CPU usage searching a binary tree. It doesn't know where to load the next node from until it's loaded the current node, so it can't pipeline memory requests. The branch mispredicts of comparisons in the tree search also help moderate consumption of the memory bandwidth that's shared by all cores. (That's right, some 100%-CPU-usage programs are worse than others!)

It's nice if emptying our dictionary doesn't leave memory fragmented when we empty it. Tree nodes will be constant size, though, so a bunch of scattered holes will be usable for future tree node allocations. However, if we have separate dictionaries for multiple radix buckets (see below), key strings associated with other dictionaries might be mixed in with tree nodes. This could lead to malloc having a hard time reusing all the freed memory, potentially increasing actual OS-visible memory usage by some small factor. (Unless C# runtime garbage collection does compaction, in which case fragmentation is taken care of.)

Since you never need to delete nodes until you want to empty the dictionary and delete them all, you could store your Tree nodes in a growable array. So memory management only has to keep track of one big allocation, reducing bookkeeping overhead compared to malloc of each node separately. Instead of real pointers, the left / right child pointers could be array indices. This lets you use only 16 or 24 bits for them. (A Heap is another kind of binary tree stored in an array, but it can't be used efficiently as a dictionary. It's a tree, but not a search tree).

Storing the string keys for a dictionary would normally be done with each String as a separately-allocated object, with pointers to them in an array. Since again, you never need to delete, grow, or even modify one until you're ready to delete them all, you can pack them head to tail in a char array, with a terminating zero-byte at the end of each one. This again saves a lot of book-keeping, and also makes it easy to keep track of how much memory is in use for the key strings, letting you safely come closer to your chosen memory upper bound.

Trie / DAWG for even more compact storage

For even denser storage of a set of strings, we can eliminate the redundancy of storing all the characters of every string, since there are probably a lot of common prefixes.

A Trie stores the strings in the tree structure, giving you common-prefix compression. It can be traversed in sorted order, so it sorts on the fly. Each node has as many children as there are different next-characters in the set, so it's not a binary tree. A C# Trie partial implementation (delete not written) can be found in this SO answer, to a question similar to this but not requiring batching / external sorting.

Trie nodes need to store potentially many child pointers, so each node can be large. Or each node could be variable-size, holding the list of nextchar:ref pairs inside the node, if C# makes that possible. Or as the Wikipedia article says, a node can actually be a linked-list or binary search tree, to avoid wasting space in nodes with few children. (The lower levels of a tree will have a lot of that.) End-of-word markers / nodes are needed to distinguish between substrings that aren't separate dictionary entries, and ones that are. Our count field can serve that purpose. Count=0 means the substring ending here isn't in the dictionary. count>=0 means it is.

A more compact Trie is the Radix Tree, or PATRICIA Tree, which stores multiple characters per node.

Another extension of this idea is the Deterministic acyclic finite state automaton (DAFSA), sometimes called a Directed Acyclic Word Graph (DAWG), but note that the DAWG wikipedia article is about a different thing with the same name. I'm not sure a DAWG can be traversed in sorted order to get all the keys out at the end, and as wikipedia points out, storing associated data (like a duplicate count) requires a modification. I'm also not sure they can be built incrementally, but I think you can do lookups without having compacted. The newly added entries will be stored like a Trie, until a compaction step every 128 new keys merges them into the DAWG. (Or run the compaction less frequently for bigger DAWGs, so you aren't doing it too much, like doubling the size of a hash table when it has to grow, instead of growing linearly, to amortize the expensive op.)

You can make a DAWG more compact by storing multiple characters in a single node when there isn't any branching / converging. This page also mentions a Huffman-coding approach to compact DAWGs, and has some other links and article citations.

JohnPaul Adamovsky's DAWG implementation (in C) looks good, and describes some optimizations it uses. I haven't looked carefully to see if it can map strings to counts. It's optimized to store all the nodes in an array.

This answer to the dup-count words in 1TB of text question suggests DAWGs, and has a couple links, but I'm not sure how useful it is.


Writing batches: Radix on first character

You could get your RadixSort on, and keep separate dictionaries for every starting character (or for a-z, non-alphabetic that sorts before a, non-alphabetic that sorts after z). Each dictionary writes out to a different temp file. If you have multiple compute nodes available for a MapReduce approach, this would be the way to distribute merging work to the compute nodes.

This allows an interesting modification: instead of writing all radix buckets at once, only write the largest dictionary as a batch. This prevents tiny batches going into some buckets each time you. This will reduce the width of the merging within each bucket, speeding up phase2.

With a binary tree, this reduces the depth of each tree by about log2(num_buckets), speeding up lookups. With a Trie, this is redundant (each node uses the next character as a radix to order the child trees). With a DAWG, this actually hurts your space-efficiency because you lose out on finding the redundancy between strings with different starts but later shared parts.

This has the potential to behave poorly if there are a few infrequently-touched buckets that keep growing, but don't usually end up being the largest. They could use up a big fraction of your total memory, making for small batches from the commonly-used buckets. You could implement a smarter eviction algorithm that records when a bucket (dictionary) was last emptied. The NeedsEmptying score for a bucket would be something like a product of size and age. Or maybe some function of age, like sqrt(age). Some way to record how many duplicates each bucket has found since last emptied would be useful, too. If you're in a spot in your input stream where there are a lot of repeats for one of the buckets, the last thing you want to do is empty it frequently. Maybe every time you find a duplicate in a bucket, increment a counter. Look at the ratio of age vs. dups-found. Low-use buckets sitting there taking RAM away from other buckets will be easy to find that way, when their size starts to creep up. Really-valuable buckets might be kept even when they're the current biggest, if they're finding a lot of duplicates.

If your data structures for tracking age and dups found is a struct-of-arrays, the (last_emptied[bucket] - current_pos) / (float)dups_found[bucket] division can be done efficiently with vector floating point. One integer division is slower than one FP division. One FP division is the same speed as 4 FP divisions, and compilers can hopefully auto-vectorize if you make it easy for them like this.

There's a lot of work to do between buckets filling up, so division would be a tiny hiccup unless you use a lot of buckets.

choosing how to bucket

With a good eviction algorithm, an ideal choice of bucketing will put keys that rarely have duplicates together in some buckets, and buckets that have many duplicates together in other buckets. If you're aware of any patterns in your data, this would be a way to exploit it. Having some buckets that are mostly low-dup means that all those unique keys don't wash away the valuable keys into an output batch. An eviction algorithm that looks at how valuable a bucket has been in terms of dups found per unique key will automatically figure out which buckets are valuable and worth keeping, even though their size is creeping up.

There are many ways to radix your strings into buckets. Some will ensure that every element in a bucket compares less than every element in every later bucket, so producing fully-sorted output is easy. Some won't, but have other advantages. There are going to be tradeoffs between bucketing choices, all of which are data-dependent:

  • good at finding a lot of duplicates in the first pass (e.g. by separating high-dup patterns from low-dup patterns)
  • distributes the number of batches uniformly between buckets (so no bucket has a huge number of batches requiring a multi-stage merge in phase2), and maybe other factors.
  • produces bad behaviour when combined with your eviction algorithm on your data set.
  • amount of between-bucket merging needed to produce globally-sorted output. The importance of this scales with the total number of unique strings, not the number of input strings.

I'm sure clever people have thought about good ways to bucket strings before me, so this is probably worth searching on if the obvious approach of by-first-character isn't ideal. This special use-case (of sorting while eliminating/counting duplicates) is not typical. I think most work on sorting only considers sorts that preserve duplicates. So you might not find much that helps choose a good bucketing algorithm for a dup-counting external sort. In any case, it will be data-dependent.

Some concrete-options for bucketing are: Radix = first two bytes together (still combining upper/lowercase, and combining non-alphabetic characters). Or Radix = the first byte of the hash code. (Requires a global-merge to produce sorted output.) Or Radix = (str[0]>>2) << 6 + str[1]>>2. i.e. ignore the low 2 bits of the first 2 chars, to put [abcd][abcd].* together, [abcd][efgh].* together, etc. This would also require some merging of the sorted results between some sets of buckets. e.g. daxxx would be in the first bucket, but aexxx would be in the 2nd. But only buckets with the same first-char high-bits need to be merged with each other to produce the sorted final output.

An idea for handling a bucketing choice that gives great dup-finding but needs merge-sorting between buckets: When writing the phase2 output, bucket it with the first character as the radix to produce the sort order you want. Each phase1 bucket scatters output into phase2 buckets as part of the global sort. Once all the phase1 batches that can include strings starting with a have been processed, do the merge of the a phase2-bucket into the final output and delete those temp files.

Radix = first 2 bytes (combining non-alphabetic) would make for for 282 = 784 buckets. With 200MiB of RAM, that's average output file size of only ~256k. Emptying just one bucket at a time would make that the minimum, and you'd usually get larger batches, so this could work. (Your eviction algorithm could hit a pathological case that made it keep a lot of big buckets, and write a series of tiny batches for new buckets. There are dangers to clever heuristics if you don't test carefully).

Multiple batches packed into the same output file is probably most useful with many small buckets. You'll have e.g. 784 output files, each containing a series of batches. Hopefully your filesystem has enough contiguous free space, and is smart enough, to do a good job of not fragmenting too badly when scattering small-ish writes to many files.


Merging:

In the merging stages, with sorted batches we don't need a dictionary. Just take the next line from the batch that has the lowest, combining duplicates as you find them.

MergeSort typically merges pairs, but when doing external sorting (i.e. disk -> disk), a much wider input is common to avoid reading and re-writing the output a lot of times. Having 25 input files open to merge into one output file should be fine. Use the library implementation of PriorityQueue (typically implemented as a Heap) to choose the next input element from many sorted lists. Maybe add input lines with the string as the priority, and the count and input file number as payload.

If you used radix distribute-by-first-character in the first pass, then merge all the a batches into the final output file (even if this process takes multiple merging stages), then all the b batches, etc. You don't need to check any of the batches from the starts-with-a bucket against batches from any other bucket, so this saves a lot of merging work, esp. if your keys are well distributed by first character.


Minimizing impact on the production server:

Throttle disk I/O during merging, to avoid bringing your server to its knees if disk prefetch generates a huge I/O queue depth of reads. Throttling your I/O, rather than a narrower merge, is probably a better choice. If the server is busy with its normal job, it prob. won't be doing many big sequential reads even if you're only reading a couple files.

Check the system load occasionally while running. If it's high, sleep for 1 sec before doing some more work and checking again. If it's really high, don't do any more work until the load average drops (sleeping 30sec between checks).

Check the system memory usage, too, and reduce your batch threshold if memory is tight on the production server. (Or if really tight, flush your partial batch and sleep until memory pressure reduces.)

If temp-file size is an issue, you could do common-prefix compression like frcode from updatedb/locate to significantly reduce the file size for sorted lists of strings. Probably use case-sensitive sorting within a batch, but case-insensitive radixing. So each batch in the a bucket will have all the As, then all the as. Or even LZ4 compress / decompress them on the fly. Use hex for the counts, not decimal. It's shorter, and faster to encode/decode.

Use a separator that's not a legal filename character, like /, between key and count. String parsing might well take up a lot of the CPU time in the merge stage, so it's worth considering. If you can leave strings in per-file input buffers, and just point your PQueue at them, that might be good. (And tell you which input file a string came from, without storing that separately.)


performance tuning:

If the initial unsorted strings were available extremely fast, then a hash table with small batches that fit the dictionary in the CPU L3 cache might be a win, unless a larger window can include a much larger fraction of keys, and find more dups. It depends on how many repeats are typical in say 100k files. Build small sorted batches in RAM as you read, then merge them to a disk batch. This may be more efficient than doing a big in-memory quicksort, since you don't have random access to the input until you've initially read it.

Since I/O will probably be the limit, large batches that don't fit in the CPU's data cache are probably a win, to find more duplicates and (greatly?) reduce the amount of merging work to be done.

It might be convenient to check the hash table size / memory consumption after every chunk of filenames you get from the OS, or after every subdirectory or whatever. As long as you choose a conservative size bound, and you make sure you can't go for too long without checking, you don't need to go nuts checking every iteration.


This paper from 1983 examines external merge-sorting eliminating duplicates as they're encountered, and also suggests duplicate elimination with a hash function and a bitmap. With long input strings, storing MD5 or SHA1 hashes for duplicate-elimination saves a lot of space.

I'm not sure what they had in mind with their bitmap idea. Being collision-resistant enough to be usable without going back to check the original string would require a hash code of too many bits to index a reasonable-size bitmap. (e.g. MD5 is a 128bit hash).

这篇关于字符串的内存受限外部排序,重复的数据在重要的服务器上组合(数十亿个文件名)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆