是否有用于 Delphi 2010 字符串 (UnicodeString) 的 Boyer-Moore 字符串搜索和快速搜索和替换功能以及快速字符串计数? [英] Is there a Boyer-Moore string search and fast search and replace function and fast string count for Delphi 2010 String (UnicodeString) out there?

查看:42
本文介绍了是否有用于 Delphi 2010 字符串 (UnicodeString) 的 Boyer-Moore 字符串搜索和快速搜索和替换功能以及快速字符串计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要三个快速处理大字符串的函数:快速搜索、快速搜索和替换以及快速计算字符串中的子字符串.

I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.

我在 C++ 和 Python 中遇到过 Boyer-Moore 字符串搜索,但我发现的唯一用于实现快速搜索和替换的 Delphi Boyer-Moore 算法是 Peter Morris 的 FastStrings 的一部分,以前是 DroopyEyes 软件,他的网站和电子邮件不再有效.

I have run into Boyer-Moore string searches in C++ and Python, but the only Delphi Boyer-Moore algorithm used to implement fast search and replace that I have found is part of the FastStrings by Peter Morris, formerly of DroopyEyes software, and his website and email are no longer working.

我已经移植了 FastStrings 以便在 Delphi 2009/2010 中为 AnsiStrings 工作得很好,其中一个字节等于一个 AnsiChar,但使它们也与 Delphi 2010 中的字符串 (UnicodeString) 一起使用似乎很重要.

I have already ported FastStrings forward to work great for AnsiStrings in Delphi 2009/2010, where a byte is equal to one AnsiChar, but making them also work with the String (UnicodeString) in Delphi 2010 appears non-trivial.

使用这种 Boyer-Moore 算法,应该可以轻松进行不区分大小写的搜索,以及不区分大小写的搜索和替换,无需任何临时字符串(使用 StrUpper 等),并且无需调用较慢的 Pos()当需要对同一文本进行重复搜索时,比 Boyer-Moore 搜索要好.

Using this Boyer-Moore algorithm, it should be possible to easily do case insensitive searches, as well as case-insensitive search and replace, without any temporary string (using StrUpper etc), and without calling Pos() which is slower than Boyer-Moore searching when repeated searches over the same text are required.

(我有一个部分解决方案,作为这个问题的答案,它几乎 100% 完成,它甚至有一个快速的字符串替换功能.我相信它一定有错误,特别是因为它假装要成为 Unicode 能力,一定是由于未履行 Unicode 承诺而出现故障.)

( I have a partial solution, written as an answer to this question, it is almost 100% complete, it even has a fast string replace function. I believe it MUST have bugs, and especially think that since it pretends to be Unicode capable that it must be that there are glitches due to unfulfilled Unicode promises. )

(Edit2:有趣且出乎意料的结果;堆栈上 unicode 代码点表的大堆栈大小 - 下面代码中的 SkipTable 严重阻碍了您可以在此处进行的双赢优化的数量unicode string boyer-moore 字符串搜索.感谢 Florent Ouchet 指出我应该立即注意到的内容.)

( Interesting and unexpected result; The large stack size of a unicode code-point table on the stack - SkipTable in the code below puts a serious damper on the amount of win-win-optimization you can do here in a unicode string boyer-moore string search. Thanks to Florent Ouchet for pointing out what I should have noticed immediately.)

推荐答案

这个答案现在已经完成并且适用于区分大小写的模式,但不适用于不区分大小写的模式,并且可能还有其他错误,因为它不是很好的单元经过测试,可能会进一步优化,例如我重复了本地函数 __SameChar 而不是使用比较函数回调,这本来会更快,实际上,允许用户为所有这些传递比较函数对 Unicode 来说非常棒想要提供一些额外逻辑的用户(某些语言的等效 Unicode 字形集).

This answer is now complete and works for case sensitive mode, but does not work for case insensitive mode, and probably has other bugs too, since it's not well unit tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages).

基于 Dorin Dominica 的代码,我构建了以下内容.

Based on Dorin Dominica's code, I built the following.

{ _FindStringBoyer:
  Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
  Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
  const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;

    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if bCaseSensitive then
        Result := (sString[StringIndex] = sPattern[PatternIndex])
      else
        Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

var
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
begin
  if fromPos < 1 then
    raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
  LengthPattern := Length(sPattern);
  LengthString := Length(sString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[sPattern[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[sPattern[LengthPattern]];
  SkipTable[sPattern[LengthPattern]] := Large;
  Index := fromPos + LengthPattern -1;
  Result := 0;
  while Index <= LengthString do begin
    repeat
      Index := Index + SkipTable[sString[Index]];
    until Index > LengthString;
    if Index <= Large then
      Break
    else
      Index := Index - Large;
    kIndex := 1;
    while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
      Inc(kIndex);
    if kIndex = LengthPattern then begin
      // Found, return.
      Result := Index - kIndex + 1;
      Index := Index + LengthPattern;
      exit;
    end else begin
      if __SameChar(Index, LengthPattern) then
        Index := Index + LastMarker
      else
        Index := Index + SkipTable[sString[Index]];
    end; // if kIndex = LengthPattern then begin
  end; // while Index <= LengthString do begin
end;

{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
  a substring inside the main string, at a much faster rate than we
  could have done otherwise.  Another thing that would be great is
  to have a function that returns an array of find-locations,
  which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
  foundPos:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := 0;
  foundPos := 1;
  fromPos := 1;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin
        // Found, return.
        //Result := Index - kIndex + 1;
        Index := Index + LengthPattern;
        fromPos := Index;
        Inc(Result);
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin

  end;
end; 

这真是一个不错的算法,因为:

This is really a nice Algorithm, because:

  • 以这种方式计算字符串 Y 中子字符串 X 的实例要快得多,非常棒.
  • 仅仅替换 Pos(),_FindStringBoyer() 比 FastCode 项目人员贡献给 Delphi 的 Pos() 的纯 asm 版本更快,目前用于 Pos,如果你需要不区分大小写,你可以想象一下当我们不必在 100 兆字节的字符串上调用 UpperCase 时的性能提升.(好吧,你的字符串不会那么大.但是,高效的算法仍然是一种美.)

好吧,我写了一个 Boyer-Moore 风格的字符串替换:

Okay I wrote a String Replace in Boyer-Moore style:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
  errors:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
  CaseSensitive:Boolean;
  foundAt:Integer;
  lastFoundAt:Integer;
  copyStartsAt:Integer;
  copyLen:Integer;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := '';
  lastFoundAt := 0;
  fromPos := 1;
  errors := 0;
  CaseSensitive := rfIgnoreCase in Flags;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    foundAt := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin


        foundAt := Index - kIndex + 1;
        Index := Index + LengthPattern;
        //fromPos := Index;
        fromPos := (foundAt+LengthPattern);
        if lastFoundAt=0 then begin
                copyStartsAt := 1;
                copyLen := foundAt-copyStartsAt;
        end else begin
                copyStartsAt := lastFoundAt+LengthPattern;
                copyLen := foundAt-copyStartsAt;
        end;

        if (copyLen<=0)or(copyStartsAt<=0) then begin
                Inc(errors);
        end;

        Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
        lastFoundAt := foundAt;
        if not (rfReplaceAll in Flags) then
                 fromPos := 0; // break out of outer while loop too!
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin
  end;
  if (lastFoundAt=0) then
  begin
     // nothing was found, just return whole original string
      Result := aSourceString;
  end
  else
  if (lastFoundAt+LengthPattern < Limit) then begin
     // the part that didn't require any replacing, because nothing more was found,
     // or rfReplaceAll flag was not specified, is copied at the
     // end as the final step.
    copyStartsAt := lastFoundAt+LengthPattern;
    copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
    Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
  end;

end;

好的,问题:这个的堆栈占用:

Okay, problem: Stack footprint of this:

var
  skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi

再见 CPU 地狱,你好堆栈地狱.如果我使用动态数组,那么我必须在运行时调整它的大小.所以这个东西基本上很快,因为你计算机上的虚拟内存系统不会在堆栈上的 256K 时闪烁,但这并不总是最佳的代码段.尽管如此,我的电脑不会在像这样的大堆栈东西上闪烁.它不会成为 Delphi 标准库的默认设置,也不会在未来赢得任何 fastcode 挑战,因为它有点占用空间.我认为重复搜索是上面代码应该写成一个类的情况,而skiptable应该是那个类中的一个数据字段.然后您可以构建一次boyer-moore 表,随着时间的推移,如果字符串不变,则重复使用该对象进行快速查找.

Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.

这篇关于是否有用于 Delphi 2010 字符串 (UnicodeString) 的 Boyer-Moore 字符串搜索和快速搜索和替换功能以及快速字符串计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆