正则表达式,StringBuilder的和大对象堆碎片 [英] RegEx, StringBuilder and Large Object Heap Fragmentation

查看:115
本文介绍了正则表达式,StringBuilder的和大对象堆碎片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何运行大量的正则表达式的(找到匹配)的大串,而不会造成蕙不成?

这是.NET Framework 4.0中,所以我用的StringBuilder ,所以它不是在LOH但是,只要我需要在其上运行一个正则表达式我必须调用 StringBuilder.ToString(),这意味着它会在蕙。

有没有办法解决这个问题?这是几乎不可能有一个长期运行的应用程序,与大字符串和正则表达式这样的处理。

构想来解决这个问题:

在思考这个问题,我想我找到了一个肮脏的解决方案。

在给定的时间我只有5字符串和这5个字符串(大于85KB)将被传递给 RegEx.Match

由于发生碎裂,因为新的对象将不适合于蕙空出的空间,这应该解决的问题:

  1. PadRight 所有字符串到最大。接受的尺寸,让我们说1024KB(我可能需要与要这样做 StringBuider
  2. 这样所有新的字符串将适用于已清空内存为previous字符串已经是超出范围
  3. 将不会有任何碎片,因为对象的大小总是相同的,因此我只拨出1024 * 5,在给定的时间,陆恭蕙这些空间将这些字符串之间共享。

我想这样的设计如果其他大型物体分配此位置蕙这会导致应用程序分配大量的1024 KB字符串可能有一个更糟糕的碎片会发生什么是最大的问题。 但是固定语句可能会帮助我怎么能发送一个固定的字符串,而无需实际创建一个不位于一个固定的内存地址的新字符串的正则表达式?

任何有关这一理论的想法? (不幸的是,我不能重现该问题很容易,我通常试图用一个内存分析器来观察什么样的独立的测试情况下,我可以写这个变化和不确定)

解决方案

好了,这是我试图解决这个问题,在一个相当通用的方式,但有一些明显的局限性。因为我还没有看到这个建议的任何地方,大家都在抱怨蕙碎片我想分享的code,以确认我的设计和假设是正确的。

理论:

  1. 创建一个共享大量的StringBuilder(这是存储读取我们从流读取大弦) - 新的StringBuilder(CHUNKSIZE * 5);
  2. 创建一个庞大的字符串(必须大于最大接受大小),应使用空初始化。 - 新的字符串('',CHUNKSIZE * 10);
  3. 在引脚字符串对象到内存中,因此GC将不惹它。 GCHandle.Alloc(pinnedText,GCHandleType.Pinned)。即使蕙对象通常固定这似乎提高性能。也许是因为不安全 code
  4. 在读流到共享的StringBuilder然后它不安全副本pinnedText使用索引
  5. 传递pinnedText以正则表达式

通过这个实现低于code工作就像没有LOH的分配。如果我切换到新的字符串('')分配,而不是使用静态的StringBuilder 或使用 StringBuilder.ToString() code可以分配少300%的内存内存溢出的异常

我也证实了结果与一个内存分析器,有在此实现无蕙碎片。我还是不明白,为什么正则表达式不会导致任何意外问题。我还测试了不同的和昂贵的正则表达式模式和结果都一样,没有碎片。

code:

http://pastebin.com/ZuuBUXk3

 使用系统;
使用System.Collections.Generic;
使用了System.Runtime.InteropServices;
使用System.Text;
使用System.Text.RegularEx pressions;

命名空间LOH_RegEx
{
    内部类节目
    {
        私有静态列表<字符串>存储=新的名单,其中,串>();
        私人const int的CHUNKSIZE = 100000;
        私有静态StringBuilder的_sb =新的StringBuilder(CHUNKSIZE * 5);


        私有静态无效的主要(字串[] args)
        {
            VAR pinnedText =新的字符串('',CHUNKSIZE * 10);
            无功源$ C ​​$ CPIN = GCHandle.Alloc(pinnedText,GCHandleType.Pinned);

            VAR RGX =新的正则表达式(A,RegexOptions.CultureInvariant | RegexOptions.Compiled);

            尝试
            {

                对于(VAR I = 0; I< 30000;我++)
                {
                    //模拟,我们读出的数据流SB
                    UpdateSB(ⅰ);
                    CopyInto(pinnedText);
                    VAR rgxMatch = rgx.Match(pinnedText);

                    如果(!rgxMatch.Success)
                    {
                        Console.WriteLine(正则表达式失败!​​);
                        到Console.ReadLine();
                    }

                    //额外的缓冲段蕙
                    storage.Add(新的字符串('Z',50000));
                    如果((我%100)== 0)
                    {
                        Console.Write第(i +,);
                    }
                }
            }
            赶上(例外前)
            {
                Console.WriteLine(ex.ToString());
                Console.WriteLine(OOM崩溃!);
                到Console.ReadLine();
            }
        }


        私有静态不安全无效CopyInto(文本字符串)
        {
            固定(字符* PCHAR =文本)
            {
                INT I;
                对于(i = 0; I< _sb.Length;我++)
                {
                    PCHAR [i] = _sb [I]
                }

                PCHAR [I + 1] ='\ 0';
            }
        }

        私有静态无效UpdateSB(INT extraSize)
        {
            _sb.Remove(0,_sb.Length);

            VAR RND =新的随机();
            对于(VAR I = 0; I< CHUNKSIZE + extraSize;我++)
            {
                _sb.Append((炭)rnd.Next(60,80));
            }
        }
    }
}
 

How can I run lots of RegExes (to find matches) in big strings without causing LOH fragmentation?

It's .NET Framework 4.0 so I'm using StringBuilder so it's not in the LOH however as soon as I need to run a RegEx on it I have to call StringBuilder.ToString() which means it'll be in the LOH.

Is there any solution to this problem? It's virtually impossible to have a long running application that deals with big strings and RegExes like this.

An Idea to Solve this problem:

While thinking about this problem, I think I found a dirty solution.

At a given time I only have 5 strings and these 5 strings (bigger than 85KB) will be passed to RegEx.Match.

Since the fragmentation occurs because new objects won't fit to empty spaces in LOH, this should solve the problem:

  1. PadRight all strings to a max. accepted size, let's say 1024KB (I might need to do this with StringBuider)
  2. By doing so all new strings will fit to already emptied memory as previous string is already out of scope
  3. There won't be any fragmentation because object size is always same hence I'll only allocate 1024*5 at a given time, and these space in LOH will be shared between these strings.

I suppose the biggest problem with this design what happens if other big objects allocate this location in LOH which would cause application to allocate lots of 1024 KB strings maybe with an even worse fragmentation. fixed statement might help however how can I send a fixed string to RegEx without actually create a new string which is not located in a fixed memory address?

Any ideas about this theory? (Unfortunately I can't reproduce the problem easily, I'm generally trying to use a memory profiler to observe the changes and not sure what kind of isolated test case I can write for this)

解决方案

OK, here is my attempt solve this problem in a fairly generic way but with some obvious limitations. Since I haven't seen this advice anywhere and everyone is whining about LOH Fragmentation I wanted to share the code to confirm that my design and assumptions are correct.

Theory:

  1. Create a shared massive StringBuilder (this is to store the big strings that read from we read from streams) - new StringBuilder(ChunkSize * 5);
  2. Create a massive String (has to be bigger than max. accepted size), should be initialized with empty space. - new string(' ', ChunkSize * 10);
  3. Pin string object to memory so GC will not mess with it. GCHandle.Alloc(pinnedText, GCHandleType.Pinned). Even though LOH objects are normally pinned this seems to improve the performance. Maybe because of unsafe code
  4. Read stream into shared StringBuilder and then unsafe copy it to pinnedText by using indexers
  5. Pass the pinnedText to RegEx

With this implementation the code below works just like there is no LOH allocation. If I switch to new string(' ') allocations instead of using a static StringBuilder or use StringBuilder.ToString() code can allocate 300% less memory before crashing with outofmemory exception

I also confirmed the results with a memory profiler, that there is no LOH fragmentation in this implementation. I still don't understand why RegEx doesn't cause any unexpected problems. I also tested with different and expensive RegEx patterns and results are same, no fragmentation.

Code:

http://pastebin.com/ZuuBUXk3

using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text;
using System.Text.RegularExpressions;

namespace LOH_RegEx
{
    internal class Program
    {
        private static List<string> storage = new List<string>();
        private const int ChunkSize = 100000;
        private static StringBuilder _sb = new StringBuilder(ChunkSize * 5);


        private static void Main(string[] args)
        {
            var pinnedText = new string(' ', ChunkSize * 10);
            var sourceCodePin = GCHandle.Alloc(pinnedText, GCHandleType.Pinned);

            var rgx = new Regex("A", RegexOptions.CultureInvariant | RegexOptions.Compiled);

            try
            {

                for (var i = 0; i < 30000; i++)
                {                   
                    //Simulate that we read data from stream to SB
                    UpdateSB(i);
                    CopyInto(pinnedText);                   
                    var rgxMatch = rgx.Match(pinnedText);

                    if (!rgxMatch.Success)
                    {
                        Console.WriteLine("RegEx failed!");
                        Console.ReadLine();
                    }

                    //Extra buffer to fragment LoH
                    storage.Add(new string('z', 50000));
                    if ((i%100) == 0)
                    {
                        Console.Write(i + ",");
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                Console.WriteLine("OOM Crash!");
                Console.ReadLine();
            }
        }


        private static unsafe void CopyInto(string text)
        {
            fixed (char* pChar = text)
            {
                int i;
                for (i = 0; i < _sb.Length; i++)
                {
                    pChar[i] = _sb[i];
                }

                pChar[i + 1] = '\0';
            }
        }

        private static void UpdateSB(int extraSize)
        {
            _sb.Remove(0,_sb.Length);

            var rnd = new Random();
            for (var i = 0; i < ChunkSize + extraSize; i++)
            {
                _sb.Append((char)rnd.Next(60, 80));
            }
        }
    }
}

这篇关于正则表达式,StringBuilder的和大对象堆碎片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆