有没有人围绕 StringBuilders 或 Streams 实现了 Regex 和/或 Xml 解析器? [英] Has anyone implemented a Regex and/or Xml parser around StringBuilders or Streams?

查看:35
本文介绍了有没有人围绕 StringBuilders 或 Streams 实现了 Regex 和/或 Xml 解析器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个压力测试客户端,它使用客户端可以召集的尽可能多的线程来锤击服务器并分析响应.我经常发现自己受到垃圾收集(和/或缺乏垃圾收集)的限制,并且在大多数情况下,归结为我正在实例化的字符串,只是为了将它们传递给 Regex 或 Xml 解析例程.

I'm building a stress-testing client that hammers servers and analyzes responses using as many threads as the client can muster. I'm constantly finding myself throttled by garbage collection (and/or lack thereof), and in most cases, it comes down to strings that I'm instantiating only to pass them off to a Regex or an Xml parsing routine.

如果你反编译 Regex 类,你会看到 在内部,它使用 StringBuilders 来做几乎所有的事情,但是你不能传递一个字符串生成器;在开始使用私有方法之前,它有助于深入研究私有方法,因此扩展方法也不会解决它.如果您想从 System.Xml.Linq 中的解析器中获取对象图,您将处于类似的情况.

If you decompile the Regex class, you'll see that internally, it uses StringBuilders to do nearly everything, but you can't pass it a string builder; it helpfully dives down into private methods before starting to use them, so extension methods aren't going to solve it either. You're in a similar situation if you want to get an object graph out of the parser in System.Xml.Linq.

这不是学究式的提前过度优化的情况.我查看了 StringBuilder 内的正则表达式替换 问题和其他.我还分析了我的应用程序以查看天花板的来源,并且现在使用 Regex.Replace() 确实在方法链中引入了大量开销,我正在尝试使用每小时处理数百万个请求,并检查 XML 响应中的错误和嵌入的诊断代码.我已经摆脱了几乎所有其他限制吞吐量的低效率,我什至通过扩展 StringBuilder 在不需要捕获组或反向引用时执行通配符查找/替换,从而减少了大量正则表达式开销,但在我看来,现在有人已经完成了基于正则表达式和 Xml 解析实用程序的自定义 StringBuilder(或者更好的是,Stream).

This is not a case of pedantic over-optimization-in-advance. I've looked at the Regex replacements inside a StringBuilder question and others. I've also profiled my app to see where the ceilings are coming from, and using Regex.Replace() now is indeed introducing significant overhead in a method chain where I'm trying to hit a server with millions of requests per hour and examine XML responses for errors and embedded diagnostic codes. I've already gotten rid of just about every other inefficiency that's throttling the throughput, and I've even cut a lot of the Regex overhead out by extending StringBuilder to do wildcard find/replace when I don't need capture groups or backreferences, but it seems to me that someone would have wrapped up a custom StringBuilder (or better yet, Stream) based Regex and Xml parsing utility by now.

好吧,这么说吧,但我必须自己做吗?

Ok, so rant over, but am I going to have to do this myself?

更新:我找到了一种解决方法,可以将峰值内存消耗从几 GB 降低到几百兆,所以我将其发布在下面.我没有将其添加为答案,因为 a) 我通常讨厌这样做,并且 b) 我仍然想知道是否有人在我这样做之前花时间自定义 StringBuilder 来执行正则表达式(反之亦然).

Update: I found a workaround which lowered peak memory consumption from multiple gigabytes to a few hundred megs, so I'm posting it below. I'm not adding it as an answer because a) I generally hate to do that, and b) I still want to find out if someone takes the time to customize StringBuilder to do Regexes (or vice-versa) before I do.

就我而言,我无法使用 XmlReader,因为我摄取的流在某些元素中包含一些无效的二进制内容.为了解析 XML,我必须清空这些元素.我以前使用单个静态编译的 Regex 实例来进行替换,这会像疯了一样消耗内存(我正在尝试处理 ~300 10KB 文档/秒).大幅减少消耗的变化是:

In my case, I could not use XmlReader because the stream I am ingesting contains some invalid binary content in certain elements. In order to parse the XML, I have to empty out those elements. I was previously using a single static compiled Regex instance to do the replace, and this consumed memory like mad (I'm trying to process ~300 10KB docs/sec). The change that drastically reduced consumption was:

  1. 我添加了这篇 StringBuilder Extensions 文章中的代码CodeProject 用于方便的 IndexOf 方法.
  2. 我添加了一个(非常)粗糙的 WildcardReplace 方法,该方法允许每次调用 一个 通配符(* 或 ?)
  3. 我用 WildcardReplace() 调用替换了 Regex 用法以清空违规元素的内容
  1. I added the code from this StringBuilder Extensions article on CodeProject for the handy IndexOf method.
  2. I added a (very) crude WildcardReplace method that allows one wildcard character (* or ?) per invocation
  3. I replaced the Regex usage with a WildcardReplace() call to empty the contents of the offending elements

这非常不美观,仅根据我自己的目的进行了测试;我会让它更优雅和强大,但是 YAGNI 和所有这些,我很着急.代码如下:

This is very unpretty and tested only as far as my own purposes required; I would have made it more elegant and powerful, but YAGNI and all that, and I'm in a hurry. Here's the code:

/// <summary>
/// Performs basic wildcard find and replace on a string builder, observing one of two 
/// wildcard characters: * matches any number of characters, or ? matches a single character.
/// Operates on only one wildcard per invocation; 2 or more wildcards in <paramref name="find"/>
/// will cause an exception.
/// All characters in <paramref name="replaceWith"/> are treated as literal parts of 
/// the replacement text.
/// </summary>
/// <param name="find"></param>
/// <param name="replaceWith"></param>
/// <returns></returns>
public static StringBuilder WildcardReplace(this StringBuilder sb, string find, string replaceWith) {
    if (find.Split(new char[] { '*' }).Length > 2 || find.Split(new char[] { '?' }).Length > 2 || (find.Contains("*") && find.Contains("?"))) {
        throw new ArgumentException("Only one wildcard is supported, but more than one was supplied.", "find");
    } 
    // are we matching one character, or any number?
    bool matchOneCharacter = find.Contains("?");
    string[] parts = matchOneCharacter ? 
        find.Split(new char[] { '?' }, StringSplitOptions.RemoveEmptyEntries) 
        : find.Split(new char[] { '*' }, StringSplitOptions.RemoveEmptyEntries);
    int startItemIdx; 
    int endItemIdx;
    int newStartIdx = 0;
    int length;
    while ((startItemIdx = sb.IndexOf(parts[0], newStartIdx)) > 0 
        && (endItemIdx = sb.IndexOf(parts[1], startItemIdx + parts[0].Length)) > 0) {
        length = (endItemIdx + parts[1].Length) - startItemIdx;
        newStartIdx = startItemIdx + replaceWith.Length;
        // With "?" wildcard, find parameter length should equal the length of its match:
        if (matchOneCharacter && length > find.Length)
            break;
        sb.Remove(startItemIdx, length);
        sb.Insert(startItemIdx, replaceWith);
    }
    return sb;
}

推荐答案

这里试试这个.一切都是基于字符的,效率相对较低.可以使用任意数量的 *? .然而,你的 * 现在是 而你的 ? 现在是 .大约三天的工作进入了这个过程,以使其尽可能干净.您甚至可以一次输入多个查询!

Here try this. Everything's char based and relatively low level for efficiency. Any number of your *s or ?s can be used. However, your * is now and your ? is now . Around three days of work went into this to make it as clean as possible. You can even enter multiple queries on one sweep!

示例用法:wildcard(new StringBuilder("Hello and Welcome"), "hello✪w★l", "be") 结果是become".

Example usage: wildcard(new StringBuilder("Hello and welcome"), "hello✪w★l", "be") results in "become".

////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////// Search for a string/s inside 'text' using the 'find' parameter, and replace with a string/s using the replace parameter
// ✪ represents multiple wildcard characters (non-greedy)
// ★ represents a single wildcard character
public StringBuilder wildcard(StringBuilder text, string find, string replace, bool caseSensitive = false)
{
    return wildcard(text, new string[] { find }, new string[] { replace }, caseSensitive);
}
public StringBuilder wildcard(StringBuilder text, string[] find, string[] replace, bool caseSensitive = false)
{
    if (text.Length == 0) return text;          // Degenerate case

    StringBuilder sb = new StringBuilder();     // The new adjusted string with replacements
    for (int i = 0; i < text.Length; i++)   {   // Go through every letter of the original large text

        bool foundMatch = false;                // Assume match hasn't been found to begin with
        for(int q=0; q< find.Length; q++) {     // Go through each query in turn
            if (find[q].Length == 0) continue;  // Ignore empty queries

            int f = 0;  int g = 0;              // Query cursor and text cursor
            bool multiWild = false;             // multiWild is ✪ symbol which represents many wildcard characters
            int multiWildPosition = 0;          

            while(true) {                       // Loop through query characters
                if (f >= find[q].Length || (i + g) >= text.Length) break;       // Bounds checking
                char cf = find[q][f];                                           // Character in the query (f is the offset)
                char cg = text[i + g];                                          // Character in the text (g is the offset)
                if (!caseSensitive) cg = char.ToLowerInvariant(cg);
                if (cf != '★' && cf != '✪' && cg != cf && !multiWild) break;        // Break search, and thus no match is found
                if (cf == '✪') { multiWild = true; multiWildPosition = f; f++; continue; }              // Multi-char wildcard activated. Move query cursor, and reloop
                if (multiWild && cg != cf && cf != '★') { f = multiWildPosition + 1; g++; continue; }   // Match since MultiWild has failed, so return query cursor to MultiWild position
                f++; g++;                                                           // Reaching here means that a single character was matched, so move both query and text cursor along one
            }

            if (f == find[q].Length) {          // If true, query cursor has reached the end of the query, so a match has been found!!!
                sb.Append(replace[q]);          // Append replacement
                foundMatch = true;
                if (find[q][f - 1] == '✪') { i = text.Length; break; }      // If the MultiWild is the last char in the query, then the rest of the string is a match, and so close off
                i += g - 1;                                                 // Move text cursor along by the amount equivalent to its found match
            }
        }
        if (!foundMatch) sb.Append(text[i]);    // If a match wasn't found at that point in the text, then just append the original character
    }
    return sb;
}

这篇关于有没有人围绕 StringBuilders 或 Streams 实现了 Regex 和/或 Xml 解析器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆