内存受限的字符串外部排序,以重复结合和放大器;数了数,一个关键的服务器上(数十亿的文件名) [英] Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

查看:142
本文介绍了内存受限的字符串外部排序,以重复结合和放大器;数了数,一个关键的服务器上(数十亿的文件名)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的服务器生成文件,如 {c521c143-2a23-42ef-89d1-557915e2323a} -sign.xml 在其日志文件夹中。第一部分是GUID;第二部分是名称的模板。

我要计数具有相同名称的模板文件的数量。举例来说,我们有

  {c521c143-2a23-42ef-89d1-557915e2323a} -sign.xml
{aa3718d1-98e2-4559-bab0-1c69f04eb7ec} -hero.xml
{0c7a50dc-972e-4062-a60c-062a51c7b32c} -sign.xml
 

结果应该是

  sign.xml,2
hero.xml,1
 

总种可能的名称模板是未知的,可能超过了 int.MaxValue

文件服务器上的总人数不详,可能超过 int.MaxValue

要求

的最终结果应当由名模板进行排序。

在该工具将运行该服务器是超临界。我们应该能够告诉内存使用(MB)和产生的,如果有的话,运行工具之前和在不知道该日志文件夹中的任何特性的临时文件的数量。

我们使用C#语言。

我的想法

  • 第5000个文件,算上事件,结果写入到 Group1.txt
  • 对于第二个5000个文件,算上事件,结果写入到 Group2.txt
  • 在重复,直到所有的文件处理。现在我们有一帮组文件。

然后我合并所有这些文件组。

  Group1.txt Group2.txt Group3.txt Group4.txt
       \ / \ /
       Group1-2.txt Group3-4.txt
                  \ /
                    Group1-4.txt
 

Group1-4.txt 是最终的结果。

我和我的朋友之间的分歧是如何算的发生。

我建议使用字典。文件名模板是关键。令m是分区大小。 (在本例中是5000),然后时间复杂度为O(M),空间复杂度为O(M)

我的朋友建议到名字的模板进行排序,然后计算发生在一个合格的同名模板都在一起了。时间复杂度为O(M日志米),空间复杂度为O(M)

我们无法说服对方。难道你们看到这两种方法中的任何问题?

解决方案

IDK如果与计数,合并重复的外部排序进行了研究。我没有找到1983年的文件(见下文)。通常情况下,排序算法的设计和研究与键排序对象的假设,因此重复的键有不同的对象。可能有这方面的一些现有的文献,但它是一个非常有趣的问题。也许它只是认为紧凑型词典结合外合并,排序的应用程序。

高效的字典存储大量串的小内存是一个很好的学习问题。大部分有用的数据结构可以包括辅助数据的每个字(在我们的情况下,DUP计数)。


TL:有用的想法DR总结,因为我在太多的细节这个答案的主体天马行空的很多事情:

  • 在你的字典大小击中一个门槛,不是固定后,输入文件编号批量边界。如果有大量的一组5000串重复的,你还是不会使用非常多的内存。您可以在第一遍方式更加复制这种方式。

  • 排序批次使得合并的的速度更快。您可以而且应该合并多对一>一,而不是二进制合并。使用一个PriorityQueue找出哪个输入文件有您应该采取下一行了。

  • 要避免突发内存使用的哈希表排序密钥时,使用字典,可以做一个在序遍历键。 (即排序的飞行。)有 SortedDictionary< TKEY的,TValue> (二叉树为主)。这也交错排序与I / O等待获得输入字符串的CPU占用率。

  • 基数排序每批进入由首字符(AZ,非字母的排序前 A ,和非字母的排序输出后的ž)。或者一些其他的铲装的选择那么好分发钥匙。使用单独的字典对于每一个基数桶,而空只有最大的成批处理时,你打你的内存上限。 (票友驱逐启发式比大可能是值得的。)

  • 油门I / O(尤其是合并时),并检查系统CPU负载,内存pressure。适应行为,因此,以确保您不会导致在服务器是最忙碌的影响。

  • 有关CPU时间成本越小临时文件,使用共preFIX编码,或者LZ4。

  • 一个节省空间的字典将允许更大批量(因此更大的重复调查窗口)为同一高端内存约束。一个特里(或更好,的板蓝根特里)可能是理想的,因为它存储的字符树节点内,只存储一次共同prefixes。 向无环道图更加紧凑的(认为不是$ P $通用子之间的冗余pfixes)。使用一个作为一个意思是棘手的,但可能是可能的(见下文)。

  • 取的事实,你并不需要删除任何树节点或字符串,直到你要清空整个词典的优势。使用节点可增长的阵列,而另一个可增长的字符数组,字符串打包头部到尾部。 (有用的一个基数特里(多字符的节点),但不是常规特里其中每个节点是一个单个字符。)

  • 根据如何重复的分布,你可能会或可能无法找到很多在第一轮。这有一定的影响,但是你怎么收场的合并并没有真正改变。


我假设你有一些目录遍历心里有数,可有效地串流提供您code被uniquified计数。所以,我只想说弦或钥匙,谈的投入。

剪掉许多不必要的字符越好(如丢失的.xml ,如果他们所有的的.xml )。


做一个单独的机器上的CPU /内存密集型工作,它可能是有用的,这取决于其他硬件你有一个快速的网络连接到您的关键生产服务器。

您可以运行发送文件名在另一台机器上,它的安全使用更多的内存的TCP连接的程序在服务器上运行一个简单的程序。在服务器上的程序仍然可以做小字典批次,只是它们存储在远程文件系统。


而现在,因为没有其他的答案真的把所有的拼在一起,这是我真实的答案:

这是上限内存的使用非常简单。编写程序使用一个常量内存上限,无论输入的大小。更大的投入将导致更多的合并阶段,在任何时候不使用更多内存。

临时文件存储空间,你不看在输入能做的最好的估计是的非常的保守的上限,它假定每个输入字符串是独一无二的。你需要一些方法来估计有多少的输入字符串就会出现。 (大多数文件系统知道它们包含多少个独立的文件,而不必走在目录树和计数。)

您可以对重复的分布一些假设,以做出更好的猜测。

如果的,而不是大小,从无到有文件是一个问题,你可以存储多个批次相同的输出文件,一个接一个。要么把长度报头在每个的开始,以允许通过批处理跳跃向前,或写字节偏移到一个单独的数据流。如果大小也很重要,请参阅有关使用FR code风格的共preFIX COM pression。

我的第

由于伊恩·默瑟指出,在他的回答,整理批次将合并它们更有效。如果不这样做,你要么风险击中墙壁,你的算法不能做出向前进步,或者你需要做的是这样的负载一个批次,扫描另一批次是第一项,并改写了第二批与只是潜在的,几个匹配条目中删除。

不整理批次使得第一遍O(N)的时间复杂度,但无论你在某一时刻后进行排序,或者您的后期阶段有一个最坏的情况势必那是急剧恶化。您希望您的输出在全球排序,比RadixSort的方法,以便其他,没有回避一个O(N日志N)的地方。

通过限批的大小,O(日志N)合并步骤预期,因此原来的分析错过了Ø无视需要做些什么的相位1批次被写入后(N日志N)你的方法的复杂性。


相应的设计选择发生很大的变化依赖于我们的记忆天花板是否足够大,在一个批次发现很多重复。如果像特里甚至是复杂紧凑的数据结构没有太大的帮助,将数据放入一个特里,并再次得到它写一个批处理是浪费CPU时间。

如果您不能每批内做太多的重复消无论如何,那么你就需要优化将可能匹配键一起进入下一阶段。你的第一个阶段能集团的输入字符串由第一个字节,到先进的252左右的输出文件(不是所有的256个值都是合法的文件名字符),或到27左右的输出文件(字母+ MISC),或26 + 26 + 1对于大/小写+非字母。临时文件可以从每个字符串省略共同preFIX。

然后大部分第一阶段批应该有一个更高的副本密度。其实,这麦冬分布的投入产出桶是在任何情况下很有用,见下文。

您应该还在整理你的第一个阶段的产出块,给下一道更广泛的DUP发现的窗口相同的RAM。


我会花更多的时间域上,你可以找到最初的流中重复的一个有用的量,使用前了〜100MiB内存,或任何你选择的上限。

显然,我们添加字符串到某种字典查找和计数的飞行重复,而只需要足够的存储空间用于该组唯一的字符串。只是存储字符串和然后的排序他们将显著效率较低,因为我们会更快没有达到我们的内存限制的即时重复检测。

要尽量减少工作阶段2,阶段1应该找到并算作很多重复越好,减少了P2的数据的总大小。减少对2阶段合并的工作量也好。 更大批次有助于这两个因素,所以这是非常有用的来接近你的记忆天花板,你可以安全地在阶段1。相反,输入字符串常量数后写了一批,做到这一点,当你的内存消耗接近您所选择的上限。重复计数和扔掉,并不会采取任何额外的存储空间。

这是替代精确存储器计费跟踪唯一字符串中的词典,这是很容易(以及你由库实现完成)。累积加入可以给你的内存用于存储字符串的良好估计字符串的长度了。或者只是让有关字符串长度分布的假设。让你的哈希表大小合适初步所以它没有,而你添加元素成长,让你停下来时,它的60%满(客座率)什么的。


一个节省空间的数据结构的字典增加了我们DUP调查窗口,对于给定的内存限制。哈希表时得到的负荷系数过高严重效率低下,而且哈希表本身只具有存储指向字符串。这是最熟悉的字典,并设有一个图书馆的实现。

我们知道我们会想我们批来分类的,一旦我们已经看够了独特的按键,所以它可能是有意义的使用,可以遍历有序字典。 排序上飞是有道理的,因为键会慢慢,因为我们是从文件系统元数据读取受限于磁盘IO。一个缺点是,如果大多数我们看到的关键是重复的,那么我们就做了很多的O(日志BATCHSIZE)查询,而不是大量的O(1)查找的。而且它更可能是一个关键的将是一个重复的时候,字典是很大的,所以大部分的O(日志BATCHSIZE)查询,将与附近的最大批量大小,介于0和最大分布不均匀。树支付每查找排序的O(log n)的开销,无论是按键竟然是唯一的,或者不是。哈希表中删除重复项后,只需要支付在最后的分拣成本。因此,对于一棵树它的O(total_keys *日志unique_keys),哈希表为O(unique_keys *登录unique_keys)到批处理进行排序。

与最大负荷率的哈希表设置为0.75或东西可能是pretty的密集,但有写出了一批可能看跌期权之前进行排序 KeyValuePair <​​/ code>取值使用标准的字典阻尼器。您不需要字符串的副本,但你可能会最终复制所有的三分球(裁判),以暂存空间的非就地排序,也许也让他们出来的时候哈希表进行排序之前。 (或者,而不只是指针,KeyValuePair,以避免必须回去看看了每个字符串的哈希表)。如果大内存消耗的短毛刺是可以容忍的,并且不会导致你交换/页到磁盘,你可能是罚款。这是可以避免的,如果你可以做一个就地排序中使用的哈希表的缓冲区,但我怀疑,可与标准库中的容器发生。

CPU使用率的不断涓流保持在快速键排序的字典可用可能比CPU使用率不频繁爆发更好的排序所有批次的按键,除了突发的内存消耗。

在.NET标准库有 SortedDictionary&LT; TKEY的,TValue&GT; ,该文档说与二叉树实现。我没有检查它是否有一个再平衡的功能,或使用红黑树,以保证O(log n)的最坏情况下的性能。我不知道有多少内存开销,那就得。如果这是一个一次性的任务,那么我绝对推荐使用这种方便快捷地实现它。并且还用于重复使用的一个更优化的设计的第一版本。你可能会发现它的不够好,除非你能找到一个很好的库实现的尝试。


数据结构的内存高效有序字典

的存取效率更高了字典,不必写出了一批并删除字典越前的DUP,我们可以找到。此外,如果它是一个有序的字典,大批量我们可以,即使他们无法找到重复的。

数据结构选择的第二个影响是,我们是多么的内存的流量产生重要的服务器上运行一段时间。排序后的数组(为O(log n)的查找时间(二进制搜索)和O(n)的插入时间(shuffle的元素,以腾出空间))将紧凑。但是,它不会仅仅是缓慢的,它会饱和内存带宽的memmove用了很多的时间。 100%的CPU使用率这样做将会对超过100%的CPU使用率搜索二叉树服务器的性能产生更大的影响。它不知道从哪里,直到它的加载当前节点加载的下一个节点,所以它不能流水线存储器请求。在树搜索比较的分支误predicts也有利于适度消费可以被所有核心共享内存带宽。 (这是正确的,有100% - CPU使用率计划是比别人差!)

这是很好的,如果排空我们的字典不离内存碎片的情况下,我们将其清空。树节点将是恒定的大小,不过,这样一帮散孔,将是未来树节点分配使用。但是,如果我们有多个基数桶单独的字典(见下文),与其它字典相关联的密钥字符串可能混在树节点。这可能会导致有malloc的一个很难重用所有释放的内存,通过一些小的因素有可能增加实际的OS可见内存使用情况。 (除非C#运行时垃圾回收不压实,在这种情况下,碎片化的照顾。)

因为你永远不必删除节点,直到要清空字典和删除它们,你可以存储你的树节点在一个可增长的阵列。因此,内存管理只需要保持一个大的分配轨道,降低簿记开销相比,malloc的每个节点分别。而不是真正的指针,左/右孩子指针可能是数组索引。这可让您为他们只使用16位或24位。 (A 是另一种存储在数组中的二进制树,但它可以'T被有效地用作字典。这是一棵树,但不是的搜索的树)。

存放字符串键为一个字典通常将与每个字符串作为一个单独分配的对象进行,并指向他们在数组中。由于再次,您再也不必删除,增加,甚至修改一个,直到你准备将它们全部删除,就可以收拾他们头尾的字符数组,以终止零字节,在每个人的终点。这又节省了大量簿记,也可以很容易地保持多少内存在使用中的关键字符串的轨道,让您安全更接近您所选择的内存上限。

特里/ DAWG为更紧凑的存储

有关连密集存储一组字符串中,我们可以消除存储每一个字符串的所有字符的冗余,因为有可能很多共同prefixes。

一个特里存储在树形结构中的字符串,使您共preFIX COM pression。它可以遍历有序,所以排序的飞行。每个节点有尽可能多的儿童有不同的下字符集,所以它不是一个二叉树。 AC#特里部分实现(删除不写)可以在发现这太回答,类似于这一点,但不是一个问题需要配料/外部排序。

特里节点需要保存潜在的很多孩子指针,因此每个节点可以很大。或者,每个节点可以是大小可变的,拿着nextchar名单:裁判对的节点内,如果C#,使这一点。或者,正如维基百科的文章说,一个节点实际上是一个链接列表或二叉搜索树,以避免与几个孩子节点浪费空间。 (一树的较低水平将有很多的。)终字的标记/所需节点的子字符串不是单独的字典项,并且那些有区分。我们的计数字段可以达到这一目的。计数= 0意味着此处结束子是不在字典。数> = 0意味着它是。

一个更紧凑的特里是基数树,或者Patricia树,该存储每个多个字符节点。

这种想法的另一部分是确定性的非循环有限状态自动机(DAFSA),有时也被称为向非循环词图(DAWG),但需要注意的是 DAWG维基百科的文章是一个不同的事情具有相同名称。我不知道一个DAWG可以遍历排序才能得到按键全部失灵末,和维基百科指出,存储相关的数据(如重复计数)需要修改。我也不敢肯定能否逐步建立,但我认为你可以做查询,而不必压实。新添加的条目将被储存像特里,直到压实步骤每128个新的密钥并将其合并到DAWG。 (或者在压实较少更大DAWGs,这样你就不会做太多,就像当它成长哈希表的规模扩大一倍,而不是线性增长的,分期偿还昂贵的操作。)

可以使通过存储多个字符在单个节点中的DAWG更为紧凑时,没有任何支化/会聚。 此页面还提到一个霍夫曼编码方法紧凑DAWGs,并具有一定的其他链接和文章引文。

JohnPaul Adamovsky的DAWG实现(C)看起来不错,并介绍了一些优化它使用。我没有仔细看,看它是否可以映射字符串计数。它的优化,以存储所有的节点中的阵列

<一个href="http://stackoverflow.com/questions/12190326/parsing-one-terabyte-of-text-and-efficiently-counting-the-number-of-occurrences#comment31962178_12399830">This回答的DUP数词语的1TB的文字问题提出DAWGs,并有一对夫妇的联系,但我不知道它是多么有用。


写作批次:板蓝根在第一个字符

您可以让您的RadixSort上,并保持独立的字典对于每一个开始字符(或AZ,非字母的排序一,非字母是从Z前后)。每个字典写出到一个不同的临时文件。如果你有一个马preduce方式提供多个计算节点,这将是分发合并工作,以计算节点的方式。

这使得一个有趣的修改:不是编写所有的基数桶一次,只写了最大的字典作为一个批次。这prevents微小的批次进入一些水桶每次。这将减少合并的宽度,每个桶中,加速阶段2。

通过一个二叉树,这降低了每棵树的深度约LOG2(num_buckets),加快查询。随着特里,这是多余的(每个的节点使用的下一个字符作为基数命令孩子树)。随着DAWG,这实际上伤害了你的空间效率,因为你失去了寻找不同的字符串开始,但后来共用部位之间的冗余。

这有不良行为,如果有一些不经常触摸的桶是保持增长,但不要通常最终会是最大的潜力。他们可以使用你的总内存的一大比例,使得从常用桶小批量。你可以实现一个更聪明的驱逐算法,它记录在一个桶(字典)最后清空。该NeedsEmptying得分桶会是这样的尺寸和时代的产物。或者,也许以下一些功能,如开方(年龄)。一些方式来记录有多少重复每个桶发现自去年走光会很有用。如果你在你的输入流中的那个地方有很多重复的桶中的一个,你想要做的最后一件事是空的频繁。也许你会发现一个重复的水桶每次递增计数器。看年龄与比例的DUP发现的。低使用水桶坐在那里取RAM远离其他水桶将很容易找到这样一来,当他们的规模开始攀升。他们是目前最大的时候,确实是很有价值的桶可保持均匀,如果他们发现了很多重复的。

如果你的数据结构来跟踪年龄和DUP的发现是一个结构阵列的-的(last_emptied [斗] - current_pos)/(浮点)dups_found [斗] 师可以有效地矢量浮点完成。一个整数除法慢于一个FP师。一个FP事业部是相同的速度,4 FP部门和编译器可以希望自动矢量化,如果你让他们更容易这样。

还有很多工作桶之间做填满,所以除非你用除法将是一个很小的一个打嗝的很多的水桶。

选择如何斗

有了一个良好的逐出算法,铲装的理想选择将会把钥匙是很少有重复一起在一些水桶,而且有许多共同的重复在其他桶桶。如果你发现任何模式中你的数据,这将是一个方式来利用它。有一些水桶的大多是所有这些独特的密钥不洗去了宝贵的钥匙到输出一批低DUP手段。驱逐算法着眼于如何宝贵的水桶一直在每唯一钥匙上发现了DUP条款将自动计算出这桶是有价值的,值得保存,尽管他们的规模攀升。

许多的方式来基数的字符串到水桶。一些将确保在一个桶的每个元素进行比较小于每购买桶的每个元素,从而产生完全排序输出是容易的。有些不会,但还有其他优点。有将要铲装选择之间权衡,所有这些都是数据相关:

  • 善于发现了大量重复的第一阶段(从低DUP模式分离高DUP模式EG)
  • 分发批次水桶均匀地之间的数量(因此没有桶具有巨大需要多级批数在阶段2合并),以及也许其它因素。
  • 在生产时,与您的数据集的逐出算法相结合的不良行为。
  • 合并所需的斗量之间产生全局有序输出。这一点的重要性可随着唯一字符串的总数,而不是输入串的个数。

我相信聪明的人曾经想过好方法我面前斗弦,所以这可能是值得寻找。如果用一字符的显而易见的方法是不理想的。这种特殊用例(排序,同时消除/重复计数)不典型。我觉得最整理工作只考虑各种各样的preserve重复。所以,你可能找不到太多可以帮助选择一个良好的铲装算法的DUP计数外部排序。在任何情况下,这将是数据相关的。

一些具体-选项铲装是:板蓝根=前两个字节一起(仍然结合大/小写,并结合非字母字符)。或基数=的哈希值code中的第一个字节。 (需要一个全球性的合并产生有序输出。)或基数= (STR [0]&GT;&GT; 2)&LT;&LT; 6 + STR [1]&GT;&GT; 2 。即忽视低2位的第2个字符的,把【ABCD】【ABCD】。* 在一起, [ABCD] [EFGH。 * 在一起,等等。这也需要一些套桶之间的排序结果一些合并。例如 daxxx 将在第一个桶,但 aexxx 将在第2位。但只能用相同的第一字符的高比特水桶需要相互合并,以产生已排序的最终输出。

用于处理铲装的选择,让一个思想伟大DUP调查,但需要桶之间的合并 - 排序:在编写相位2输出,斗它的第一个字符为基数,以生成所需的排序顺序。每个相位1桶散落输出到阶段2桶作为全球排序的一部分。一旦所有的阶段1批次,其中可以包括开头的字符串 A 已被处理,执行合并的 A phase2-斗到最后的输出,并删除这些临时文件。

基数=前2个字节(结合非字母),将使为28 2 = 784桶。随着RAM 200MiB,这只是〜256K平均输出文件的大小。同时清空只是一个桶将使该最小值,你通常会得到较大的批次,所以这可能是工作。 (你逐出算法可以打,使得它保留了很多大水桶的病理情况下,并写了一系列的小批量新桶。有危险,灵巧的启发式方法,如果你不仔细测试)。

包装到同一输出文件中的多个批次可能是有许多小水桶最有用的。你必须例如784输出文件,每个文件包含一系列批次。希望你的文件系统有足够的连续可用空间,并且是足够聪明,当散射小十岁上下要不要割裂了很好的工作太糟糕写入多个文件。


合并:

在合流阶段,排序批次我们不需要一个字典。只要看看从具有最低的,结合重复批次中的下一行,你找到他们。

归并通常合并对,但在做外部排序(即磁盘 - >磁盘),宽得多的输入是很常见,以避免读取和重新写入输出了很多次。有25个输入文件开放给合并到一个输出文件应该罚款。使用库实现的PriorityQueue(通常实现为堆)从众多分类的列表中选择下一个输入元素。也许用字符串为优先,并且计数和输入文件编号作为净荷加上输入线

如果您使用的基数分配按首字符在第一遍,然后合并所有的 A 分批进入最终的输出文件(即使这个过程需要多合并阶段),那么所有的 B 批次等您不需要任何检查的批次从开始 - 与 - 斗对批从任何其他斗,所以这节省了合并工作,尤指一个的很多的。如果你的密钥以及分发第一个字符。


在生产服务器上最小化的影响:

油门磁盘I /合并过程中0,避免把你的服务器瘫痪,如果磁盘prefetch产生的读取一个巨大的I / O队列深度。节流你的I / O,而不是更窄的合并,可能是一个更好的选择。如果服务器正忙于处理其正常工作,这概率。不会做很多大的顺序读取,即使你只是阅读一些文件。

检查系统负载偶尔运行时。如果它的高,做一些更多的工作,并再次检查之前暂停1秒。如果这是真的很高,直到负载平均下降(支票,睡30秒)没有做任何更多的工作。

检查系统内存使用情况,也并降低批次门槛,如果内存紧张的生产服务器上。 (或者,如果真的紧张,刷新你的部分批次和睡眠,直到内存pressure降低。)

如果临时文件的大小是一个问题,你可以做的共preFIX COM pression就像从数据库更新FR code /定位以显著减少文件大小的字符串的排序列表。可能使用一个批次中区分大小写排序,但不区分大小写radixing。因此,每个批次中的 A 桶将所有的 A s,则所有的 A 秒。甚至LZ4 COM preSS / DECOM preSS他们的飞行。使用十六进制的数,不是小数。这是更短,更快地连接code /德code。

使用一个分隔符,这不是一个合法的文件名字符,如 / 键和计数之间。



Our server produces files like {c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml in its log folder. The first part is GUID; the second part is name template.

I want to count the number of files with same name template. For instance, we have

{c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml
{aa3718d1-98e2-4559-bab0-1c69f04eb7ec}-hero.xml
{0c7a50dc-972e-4062-a60c-062a51c7b32c}-sign.xml

The result should be

sign.xml,2
hero.xml,1

The total kinds of possible name templates is unknown, possibly exceeds int.MaxValue.

The total number of files on the server is unknown, possibly exceeds int.MaxValue.

Requirements:

The final result should be sorted by name template.

The server on which the tool is going to run is super critical. We should be able to tell memory usage (MB) and the number of temporary files generated, if any, before running the tool and without knowing any characteristic of the log folder.

We use C# language.

My idea:

  • For the first 5000 files, count the occurrences, write result to Group1.txt.
  • For the second 5000 files, count the occurrences, write result to Group2.txt.
  • Repeat until all files are processed. Now we have a bunch of group files.

Then I merge all these group files.

   Group1.txt     Group2.txt   Group3.txt     Group4.txt   
       \            /            \                /
       Group1-2.txt                Group3-4.txt
                  \                 /
                    Group1-4.txt

Group1-4.txt is the final result.

The disagreement between me and my friend is how we count the occurrences.

I suggest to use dictionary. File name template is key. Let m be partition size. (In this example it's 5000.) Then time complexity O(m), space complexity O(m).

My friend suggests to sort the name template then count the occurrence in one pass as same name templates are all together now. time complexity O(m log m), space complexity O(m).

We cannot persuade each other. Do you guys see any issues of the two methods?

解决方案

IDK if external sorting with count-merging of duplicates has been studied. I did find a 1983 paper (see below). Usually, sorting algorithms are designed and studied with the assumption of sorting objects by keys, so duplicate keys have different objects. There might be some existing literature on this, but it's a very interesting problem. Probably it's just considered an application of compact dictionaries combined with external merge-sorting.

Efficient dictionaries for storing large amounts of strings in little memory is a very well studied problem. Most of the useful data structures can include auxiliary data for each word (in our case, a dup count).


TL:DR summary of useful ideas, since I rambled on in way too much detail about many things in the main body of this answer:

  • Batch boundaries when your dictionary size hits a threshold, not after fixed numbers of input files. If there were a lot of duplicates in a group of 5000 strings, you still won't be using very much memory. You can find way more duplicates in the first pass this way.

  • Sorted batches makes merging much faster. You can and should merge many->one instead of binary merging. Use a PriorityQueue to figure out which input file has the line you should take next.

  • To avoid a burst of memory usage when sorting the keys in a hash table, use a dictionary that can do an in-order traversal of keys. (i.e. sort on the fly.) There's SortedDictionary<TKey, TValue> (binary tree-based). This also interleaves the CPU usage of sorting with the I/O waiting to get the input strings.

  • Radix-sort each batch into outputs by first-character (a-z, non-alphabetic that sorts before A, and non-alphabetic that sorts after z). Or some other bucketing choice that distributes your keys well. Use separate dictionaries for each radix bucket, and empty only the biggest into a batch when you hit your memory ceiling. (fancier eviction heuristics than "biggest" may be worth it.)

  • throttle I/O (esp. when merging), and check system CPU load and memory pressure. Adapt behaviour accordingly to make sure you don't cause an impact when the server is most busy.

  • For smaller temp files at the cost of CPU time, use a common-prefix encoding, or maybe lz4.

  • A space-efficient dictionary will allow larger batch sizes (and thus a larger duplicate-finding window) for the same upper memory bound. A Trie (or better, Radix Trie) might be ideal, because it stores the characters within the tree nodes, with common prefixes only stored once. Directed Acyclic Word Graphs are even more compact (finding redundancy between common substrings that aren't prefixes). Using one as a Dictionary is tricky but probably possible (see below).

  • Take advantage of the fact that you don't need to delete any tree nodes or strings until you're going to empty the whole dictionary. Use a growable array of nodes, and another growable char array that packs strings head to tail. (Useful for a Radix Trie (multi-char nodes), but not a regular Trie where each node is a single char.)

  • Depending on how the duplicates are distributed, you might or might not be able to find very many on the first pass. This has some implications, but doesn't really change how you end up merging.


I'm assuming you have some directory traversal idea in mind, which can efficiently supply your code with a stream of strings to be uniquified and counted. So I'll just say "strings" or "keys", to talk about the inputs.

Trim off as many unnecessary characters as possible (e.g. lose the .xml if they're all .xml).


It might be useful to do the CPU/memory intensive work on a separate machine, depending on what other hardware you have with a fast network connection to your critical production server.

You could run a simple program on the server that sends filenames over a TCP connection to a program running on another machine, where it's safe to use much more memory. The program on the server could still do small dictionary batches, and just store them on a remote filesystem.


And now, since none of the other answers really put all the pieces together, here's my actual answer:

An upper bound on memory usage is easy. Write your program to use a constant memory ceiling, regardless of input size. Bigger inputs will lead to more merging phases, not more memory usage at any point.

The best estimate of temporary file storage space you can do without looking at the input is a very conservative upper bound that assumes every input string is unique. You need some way to estimate how many input strings there will be. (Most filesystems know how many separate files they contain, without having to walk the directory tree and count them.)

You can make some assumptions about the distribution of duplicates to make a better guess.

If number, rather than size, of scratch files is an issue, you can store multiple batches in the same output file, one after another. Either put length-headers at the start of each to allow skipping forward by batch, or write byte offsets to a separate data stream. If size is also important, see my paragraph about using frcode-style common-prefix compression.


As Ian Mercer points out in his answer, sorting your batches will make merging them much more efficient. If you don't, you either risk hitting a wall where your algorithm can't make forward progress, or you need to do something like load one batch, scan another batch for entries that are in the first, and rewrite the 2nd batch with just the potentially-few matching entries removed.

Not sorting your batches makes the time complexity of the first pass O(N), but either you have to sort at some point later, or your later stages have a worst-case bound that's dramatically worse. You want your output globally sorted, so other than RadixSort approaches, there's no avoiding an O(N log N) somewhere.

With limited batch size, O(log N) merge steps are expected, so your original analysis missed the O(N log N) complexity of your approach by ignoring what needs to happen after the phase1 batches are written.


The appropriate design choices change a lot depending on whether our memory ceiling is big enough to find many duplicates within one batch. If even a complex compact data structure like a Trie doesn't help much, putting the data into a Trie and getting it out again to write a batch is a waste of CPU time.

If you can't do much duplicate-elimination within each batch anyway, then you need to optimize for putting possibly-matching keys together for the next stage. Your first stage could group input strings by first byte, into up-to 252 or so output files (not all 256 values are legal filename characters), or into 27 or so output files (alphabet + misc), or 26+26+1 for upper/lower case + non-alphabetic. Temp files can omit the common prefix from each string.

Then most of these first stage batches should have a much higher duplicate density. Actually, this Radix distribution of inputs into output buckets is useful in any case, see below.

You should still sort your first-stage outputs in chunks, to give the next pass a much wider dup-find window for the same RAM.


I'm going to spend more time on the domain where you can find a useful amount of duplicates in the initial stream, before using up ~100MiB of RAM, or whatever you choose as an upper limit.

Obviously we add strings to some sort of dictionary to find and count duplicates on the fly, while only requiring enough storage for the set of unique strings. Just storing strings and then sorting them would be significantly less efficient, because we'd hit our RAM limit much sooner without on-the-fly duplicate detection.

To minimize the phase2 work, phase1 should find and count as many duplicates as possible, reducing the total size of the p2 data. Reducing the amount of merging work for phase2 is good, too. Bigger batches helps with both factors, so it's very useful to come as close to your memory ceiling as you safely can in phase1. Instead of writing a batch after a constant number of input strings, do it when your memory consumption nears your chosen ceiling. Duplicates are counted and thrown away, and don't take any extra storage.

An alternative to accurate memory accounting is tracking the unique strings in your dictionary, which is easy (and done for you by the library implementation). Accumulating the length of strings added can give you a good estimate of memory used for storing the strings, too. Or just make an assumption about string length distribution. Make your hash table the right size initially so it doesn't have to grow while you add elements, so you stop when it's 60% full (load factor) or something.


A space-efficient data structure for the dictionary increases our dup-finding window for a given memory limit. Hash tables get badly inefficient when their load factor is too high, but the hash table itself only has to store pointers to the strings. It's the most familiar dictionary and has a library implementations.

We know we're going to want our batch sorted once we've seen enough unique keys, so it might make sense to use a dictionary that can be traversed in sorted order. Sorting on the fly makes sense because keys will come in slowly, limited by disk IO since we're reading from filesystem metadata. One downside is if most of the keys we see are duplicates, then we're doing a lot of O(log batchsize) lookups, rather than a lot of O(1) lookups. And it's more likely that a key will be a duplicate when the dictionary is big, so most of those O(log batchsize) queries will be with a batch size near max, not uniformly distributed between 0 and max. A tree pays the O(log n) overhead of sorting for every lookup, whether the key turned out to be unique or not. A hash table only pays the sorting cost at the end after removing duplicates. So for a tree it's O(total_keys * log unique_keys), hash table is O(unique_keys * log unique_keys) to sort a batch.

A hash table with max load factor set to 0.75 or something might be pretty dense, but having to sort the KeyValuePairs before writing out a batch probably puts a damper on using standard Dictionary. You don't need copies of the strings, but you'll probably end up copying all the pointers (refs) to scratch space for a non-in-place sort, and maybe also when getting them out of the hash table before sorting. (Or instead of just pointers, KeyValuePair, to avoid having to go back and look up each string in the hash table). If short spikes of big memory consumption are tolerable, and don't cause you to swap / page to disk, you could be fine. This is avoidable if you can do an in-place sort in the buffer used by the hash table, but I doubt that can happen with standard-library containers.

A constant trickle of CPU usage to maintain the sorted dictionary at the speed keys are available is probably better than infrequent bursts of CPU usage to sort all of a batch's keys, besides the burst of memory consumption.

The .NET standard library has SortedDictionary<TKey, TValue>, which the docs say is implemented with a binary tree. I didn't check if it has a rebalance function, or uses a red-black tree, to guarantee O(log n) worst case performance. I'm not sure how much memory overhead it would have. If this is a one-off task, then I'd absolutely recommend using this to implement it quickly and easily. And also for a first version of a more optimized design for repeated use. You'll probably find it's good enough, unless you can find a nice library implementation of Tries.


Data structures for memory-efficient sorted dictionaries

The more memory efficient out dictionary is, the more dups we can find before having to write out a batch and delete the dictionary. Also, if it's a sorted dictionary, the larger our batches can be even when they can't find duplicates.

A secondary impact of data structure choice is how much memory traffic we generate while running on the critical server. A sorted array (with O(log n) lookup time (binary search), and O(n) insert time (shuffle elements to make room)) would be compact. However, it wouldn't just be slow, it would saturate memory bandwidth with memmove a lot of the time. 100% CPU usage doing this would have a bigger impact on the server's performance than 100% CPU usage searching a binary tree. It doesn't know where to load the next node from until it's loaded the current node, so it can't pipeline memory requests. The branch mispredicts of comparisons in the tree search also help moderate consumption of the memory bandwidth that's shared by all cores. (That's right, some 100%-CPU-usage programs are worse than others!)

It's nice if emptying our dictionary doesn't leave memory fragmented when we empty it. Tree nodes will be constant size, though, so a bunch of scattered holes will be usable for future tree node allocations. However, if we have separate dictionaries for multiple radix buckets (see below), key strings associated with other dictionaries might be mixed in with tree nodes. This could lead to malloc having a hard time reusing all the freed memory, potentially increasing actual OS-visible memory usage by some small factor. (Unless C# runtime garbage collection does compaction, in which case fragmentation is taken care of.)

Since you never need to delete nodes until you want to empty the dictionary and delete them all, you could store your Tree nodes in a growable array. So memory management only has to keep track of one big allocation, reducing bookkeeping overhead compared to malloc of each node separately. Instead of real pointers, the left / right child pointers could be array indices. This lets you use only 16 or 24 bits for them. (A Heap is another kind of binary tree stored in an array, but it can't be used efficiently as a dictionary. It's a tree, but not a search tree).

Storing the string keys for a dictionary would normally be done with each String as a separately-allocated object, with pointers to them in an array. Since again, you never need to delete, grow, or even modify one until you're ready to delete them all, you can pack them head to tail in a char array, with a terminating zero-byte at the end of each one. This again saves a lot of book-keeping, and also makes it easy to keep track of how much memory is in use for the key strings, letting you safely come closer to your chosen memory upper bound.

Trie / DAWG for even more compact storage

For even denser storage of a set of strings, we can eliminate the redundancy of storing all the characters of every string, since there are probably a lot of common prefixes.

A Trie stores the strings in the tree structure, giving you common-prefix compression. It can be traversed in sorted order, so it sorts on the fly. Each node has as many children as there are different next-characters in the set, so it's not a binary tree. A C# Trie partial implementation (delete not written) can be found in this SO answer, to a question similar to this but not requiring batching / external sorting.

Trie nodes need to store potentially many child pointers, so each node can be large. Or each node could be variable-size, holding the list of nextchar:ref pairs inside the node, if C# makes that possible. Or as the Wikipedia article says, a node can actually be a linked-list or binary search tree, to avoid wasting space in nodes with few children. (The lower levels of a tree will have a lot of that.) End-of-word markers / nodes are needed to distinguish between substrings that aren't separate dictionary entries, and ones that are. Our count field can serve that purpose. Count=0 means the substring ending here isn't in the dictionary. count>=0 means it is.

A more compact Trie is the Radix Tree, or PATRICIA Tree, which stores multiple characters per node.

Another extension of this idea is the Deterministic acyclic finite state automaton (DAFSA), sometimes called a Directed Acyclic Word Graph (DAWG), but note that the DAWG wikipedia article is about a different thing with the same name. I'm not sure a DAWG can be traversed in sorted order to get all the keys out at the end, and as wikipedia points out, storing associated data (like a duplicate count) requires a modification. I'm also not sure they can be built incrementally, but I think you can do lookups without having compacted. The newly added entries will be stored like a Trie, until a compaction step every 128 new keys merges them into the DAWG. (Or run the compaction less frequently for bigger DAWGs, so you aren't doing it too much, like doubling the size of a hash table when it has to grow, instead of growing linearly, to amortize the expensive op.)

You can make a DAWG more compact by storing multiple characters in a single node when there isn't any branching / converging. This page also mentions a Huffman-coding approach to compact DAWGs, and has some other links and article citations.

JohnPaul Adamovsky's DAWG implementation (in C) looks good, and describes some optimizations it uses. I haven't looked carefully to see if it can map strings to counts. It's optimized to store all the nodes in an array.

This answer to the dup-count words in 1TB of text question suggests DAWGs, and has a couple links, but I'm not sure how useful it is.


Writing batches: Radix on first character

You could get your RadixSort on, and keep separate dictionaries for every starting character (or for a-z, non-alphabetic that sorts before a, non-alphabetic that sorts after z). Each dictionary writes out to a different temp file. If you have multiple compute nodes available for a MapReduce approach, this would be the way to distribute merging work to the compute nodes.

This allows an interesting modification: instead of writing all radix buckets at once, only write the largest dictionary as a batch. This prevents tiny batches going into some buckets each time you. This will reduce the width of the merging within each bucket, speeding up phase2.

With a binary tree, this reduces the depth of each tree by about log2(num_buckets), speeding up lookups. With a Trie, this is redundant (each node uses the next character as a radix to order the child trees). With a DAWG, this actually hurts your space-efficiency because you lose out on finding the redundancy between strings with different starts but later shared parts.

This has the potential to behave poorly if there are a few infrequently-touched buckets that keep growing, but don't usually end up being the largest. They could use up a big fraction of your total memory, making for small batches from the commonly-used buckets. You could implement a smarter eviction algorithm that records when a bucket (dictionary) was last emptied. The NeedsEmptying score for a bucket would be something like a product of size and age. Or maybe some function of age, like sqrt(age). Some way to record how many duplicates each bucket has found since last emptied would be useful, too. If you're in a spot in your input stream where there are a lot of repeats for one of the buckets, the last thing you want to do is empty it frequently. Maybe every time you find a duplicate in a bucket, increment a counter. Look at the ratio of age vs. dups-found. Low-use buckets sitting there taking RAM away from other buckets will be easy to find that way, when their size starts to creep up. Really-valuable buckets might be kept even when they're the current biggest, if they're finding a lot of duplicates.

If your data structures for tracking age and dups found is a struct-of-arrays, the (last_emptied[bucket] - current_pos) / (float)dups_found[bucket] division can be done efficiently with vector floating point. One integer division is slower than one FP division. One FP division is the same speed as 4 FP divisions, and compilers can hopefully auto-vectorize if you make it easy for them like this.

There's a lot of work to do between buckets filling up, so division would be a tiny hiccup unless you use a lot of buckets.

choosing how to bucket

With a good eviction algorithm, an ideal choice of bucketing will put keys that rarely have duplicates together in some buckets, and buckets that have many duplicates together in other buckets. If you're aware of any patterns in your data, this would be a way to exploit it. Having some buckets that are mostly low-dup means that all those unique keys don't wash away the valuable keys into an output batch. An eviction algorithm that looks at how valuable a bucket has been in terms of dups found per unique key will automatically figure out which buckets are valuable and worth keeping, even though their size is creeping up.

There are many ways to radix your strings into buckets. Some will ensure that every element in a bucket compares less than every element in every later bucket, so producing fully-sorted output is easy. Some won't, but have other advantages. There are going to be tradeoffs between bucketing choices, all of which are data-dependent:

  • good at finding a lot of duplicates in the first pass (e.g. by separating high-dup patterns from low-dup patterns)
  • distributes the number of batches uniformly between buckets (so no bucket has a huge number of batches requiring a multi-stage merge in phase2), and maybe other factors.
  • produces bad behaviour when combined with your eviction algorithm on your data set.
  • amount of between-bucket merging needed to produce globally-sorted output. The importance of this scales with the total number of unique strings, not the number of input strings.

I'm sure clever people have thought about good ways to bucket strings before me, so this is probably worth searching on if the obvious approach of by-first-character isn't ideal. This special use-case (of sorting while eliminating/counting duplicates) is not typical. I think most work on sorting only considers sorts that preserve duplicates. So you might not find much that helps choose a good bucketing algorithm for a dup-counting external sort. In any case, it will be data-dependent.

Some concrete-options for bucketing are: Radix = first two bytes together (still combining upper/lowercase, and combining non-alphabetic characters). Or Radix = the first byte of the hash code. (Requires a global-merge to produce sorted output.) Or Radix = (str[0]>>2) << 6 + str[1]>>2. i.e. ignore the low 2 bits of the first 2 chars, to put [abcd][abcd].* together, [abcd][efgh].* together, etc. This would also require some merging of the sorted results between some sets of buckets. e.g. daxxx would be in the first bucket, but aexxx would be in the 2nd. But only buckets with the same first-char high-bits need to be merged with each other to produce the sorted final output.

An idea for handling a bucketing choice that gives great dup-finding but needs merge-sorting between buckets: When writing the phase2 output, bucket it with the first character as the radix to produce the sort order you want. Each phase1 bucket scatters output into phase2 buckets as part of the global sort. Once all the phase1 batches that can include strings starting with a have been processed, do the merge of the a phase2-bucket into the final output and delete those temp files.

Radix = first 2 bytes (combining non-alphabetic) would make for for 282 = 784 buckets. With 200MiB of RAM, that's average output file size of only ~256k. Emptying just one bucket at a time would make that the minimum, and you'd usually get larger batches, so this could work. (Your eviction algorithm could hit a pathological case that made it keep a lot of big buckets, and write a series of tiny batches for new buckets. There are dangers to clever heuristics if you don't test carefully).

Multiple batches packed into the same output file is probably most useful with many small buckets. You'll have e.g. 784 output files, each containing a series of batches. Hopefully your filesystem has enough contiguous free space, and is smart enough, to do a good job of not fragmenting too badly when scattering small-ish writes to many files.


Merging:

In the merging stages, with sorted batches we don't need a dictionary. Just take the next line from the batch that has the lowest, combining duplicates as you find them.

MergeSort typically merges pairs, but when doing external sorting (i.e. disk -> disk), a much wider input is common to avoid reading and re-writing the output a lot of times. Having 25 input files open to merge into one output file should be fine. Use the library implementation of PriorityQueue (typically implemented as a Heap) to choose the next input element from many sorted lists. Maybe add input lines with the string as the priority, and the count and input file number as payload.

If you used radix distribute-by-first-character in the first pass, then merge all the a batches into the final output file (even if this process takes multiple merging stages), then all the b batches, etc. You don't need to check any of the batches from the starts-with-a bucket against batches from any other bucket, so this saves a lot of merging work, esp. if your keys are well distributed by first character.


Minimizing impact on the production server:

Throttle disk I/O during merging, to avoid bringing your server to its knees if disk prefetch generates a huge I/O queue depth of reads. Throttling your I/O, rather than a narrower merge, is probably a better choice. If the server is busy with its normal job, it prob. won't be doing many big sequential reads even if you're only reading a couple files.

Check the system load occasionally while running. If it's high, sleep for 1 sec before doing some more work and checking again. If it's really high, don't do any more work until the load average drops (sleeping 30sec between checks).

Check the system memory usage, too, and reduce your batch threshold if memory is tight on the production server. (Or if really tight, flush your partial batch and sleep until memory pressure reduces.)

If temp-file size is an issue, you could do common-prefix compression like frcode from updatedb/locate to significantly reduce the file size for sorted lists of strings. Probably use case-sensitive sorting within a batch, but case-insensitive radixing. So each batch in the a bucket will have all the As, then all the as. Or even LZ4 compress / decompress them on the fly. Use hex for the counts, not decimal. It's shorter, and faster to encode/decode.

Use a separator that's not a legal filename character, like /, between key and count. String parsing might well take up a lot of the CPU time in the merge stage, so it's worth considering. If you can leave strings in per-file input buffers, and just point your PQueue at them, that might be good. (And tell you which input file a string came from, without storing that separately.)


performance tuning:

If the initial unsorted strings were available extremely fast, then a hash table with small batches that fit the dictionary in the CPU L3 cache might be a win, unless a larger window can include a much larger fraction of keys, and find more dups. It depends on how many repeats are typical in say 100k files. Build small sorted batches in RAM as you read, then merge them to a disk batch. This may be more efficient than doing a big in-memory quicksort, since you don't have random access to the input until you've initially read it.

Since I/O will probably be the limit, large batches that don't fit in the CPU's data cache are probably a win, to find more duplicates and (greatly?) reduce the amount of merging work to be done.

It might be convenient to check the hash table size / memory consumption after every chunk of filenames you get from the OS, or after every subdirectory or whatever. As long as you choose a conservative size bound, and you make sure you can't go for too long without checking, you don't need to go nuts checking every iteration.


This paper from 1983 examines external merge-sorting eliminating duplicates as they're encountered, and also suggests duplicate elimination with a hash function and a bitmap. With long input strings, storing MD5 or SHA1 hashes for duplicate-elimination saves a lot of space.

I'm not sure what they had in mind with their bitmap idea. Being collision-resistant enough to be usable without going back to check the original string would require a hash code of too many bits to index a reasonable-size bitmap. (e.g. MD5 is a 128bit hash).

这篇关于内存受限的字符串外部排序,以重复结合和放大器;数了数,一个关键的服务器上(数十亿的文件名)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆