是否有一个博耶 - 穆尔字符串搜索和快速搜索和替换功能,快速串计数2010年德尔福字符串(统一codeString的)在那里? [英] Is there a Boyer-Moore string search and fast search and replace function and fast string count for Delphi 2010 String (UnicodeString) out there?

查看:222
本文介绍了是否有一个博耶 - 穆尔字符串搜索和快速搜索和替换功能,快速串计数2010年德尔福字符串(统一codeString的)在那里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要三快上大串功能:快速搜索,快速搜索和替换,以及快速计数的子字符串中的

I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.

我遇到了C语言博耶 - 穆尔字符串搜索++和Python的,而是用来实现快速搜索和替换,我发现的唯一德尔福博耶 - 穆尔算法是原DroopyEyes软件FastStrings由彼得·莫里斯的一部分,和他的网站和电子邮件都不再工作。

I have run into Boyer-Moore string searches in C++ and Python, but the only Delphi Boyer-Moore algorithm used to implement fast search and replace that I have found is part of the FastStrings by Peter Morris, formerly of DroopyEyes software, and his website and email are no longer working.

我已经移植 FastStrings 的期待在Delphi 2009/2010,其中一个字节等于为AnsiStrings工作的伟大1 ANSIChar类型,而且使它们还与德尔福2010年字符串(统一codeString的)工作似乎不那么容易。

I have already ported FastStrings forward to work great for AnsiStrings in Delphi 2009/2010, where a byte is equal to one AnsiChar, but making them also work with the String (UnicodeString) in Delphi 2010 appears non-trivial.

使用该博耶 - 穆尔算法,它应该可以很容易地做到不区分大小写的搜索,以及不区分大小写的搜索和替换,没有任何临时的字符串(使用StrUpper等),也没有调用波什(),这是比较慢当在被要求相同的文字重复比搜索博耶 - 穆尔搜索。

Using this Boyer-Moore algorithm, it should be possible to easily do case insensitive searches, as well as case-insensitive search and replace, without any temporary string (using StrUpper etc), and without calling Pos() which is slower than Boyer-Moore searching when repeated searches over the same text are required.

(编辑:我已经解决部分问题写成一个回答这个问题,这几乎是100%完成,它甚至有一个快速的字符串替换功能,我相信它一定缺陷,特别是认为既然$。 p $ ptends是统一code能力,它必须是有因未履行统一code承诺毛刺。)

( I have a partial solution, written as an answer to this question, it is almost 100% complete, it even has a fast string replace function. I believe it MUST have bugs, and especially think that since it pretends to be Unicode capable that it must be that there are glitches due to unfulfilled Unicode promises. )

(EDIT2:有趣的和意想不到的结果;一单code code点表的大堆栈大小在堆栈上 - 在code以下SkipTable提出了严重的阻尼器上以量取胜-win优化,你可以在一个单code字符串博耶 - 穆尔字符串搜索在这里做的。感谢弗洛朗Ouchet您指出我应该立刻注意到了。)

( Interesting and unexpected result; The large stack size of a unicode code-point table on the stack - SkipTable in the code below puts a serious damper on the amount of win-win-optimization you can do here in a unicode string boyer-moore string search. Thanks to Florent Ouchet for pointing out what I should have noticed immediately.)

推荐答案

这个答案现在已经完成,只是code不是很好(单位)进行测试,并且很可能得到进一步优化,比如我反复地方功能__SameChar而是采用了比较函数回调这将是更快,实际上,允许用户传递一个比较函数所有这些将是伟大的谁希望提供一些额外的逻辑统一code用户(相当于套统一的code字形的一些语言)。然而,我对不区分大小写的波什一个博耶 - 穆尔的解决方案,以及更快的子串计数,而现在搜索和替换过:

This answer is now complete except that the code is not well (Unit) tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages). I do however have a Boyer-Moore based solution for case-insensitive Pos, and faster substring counting, and now search and replace too:

,我建立了下面。

{ _FindStringBoyer:
  Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
  Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
  const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;

    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if bCaseSensitive then
        Result := (sString[StringIndex] = sPattern[PatternIndex])
      else
        Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

var
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
begin
  if fromPos < 1 then
    raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
  LengthPattern := Length(sPattern);
  LengthString := Length(sString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[sPattern[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[sPattern[LengthPattern]];
  SkipTable[sPattern[LengthPattern]] := Large;
  Index := fromPos + LengthPattern -1;
  Result := 0;
  while Index <= LengthString do begin
    repeat
      Index := Index + SkipTable[sString[Index]];
    until Index > LengthString;
    if Index <= Large then
      Break
    else
      Index := Index - Large;
    kIndex := 1;
    while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
      Inc(kIndex);
    if kIndex = LengthPattern then begin
      // Found, return.
      Result := Index - kIndex + 1;
      Index := Index + LengthPattern;
      exit;
    end else begin
      if __SameChar(Index, LengthPattern) then
        Index := Index + LastMarker
      else
        Index := Index + SkipTable[sString[Index]];
    end; // if kIndex = LengthPattern then begin
  end; // while Index <= LengthString do begin
end;

{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
  a substring inside the main string, at a much faster rate than we
  could have done otherwise.  Another thing that would be great is
  to have a function that returns an array of find-locations,
  which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
  foundPos:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := 0;
  foundPos := 1;
  fromPos := 1;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin
        // Found, return.
        //Result := Index - kIndex + 1;
        Index := Index + LengthPattern;
        fromPos := Index;
        Inc(Result);
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin

  end;
end; 

这确实是一个不错的算法,这是因为:

This is really a nice Algorithm, because:

  • 在它的方式更快来算子的X实例的序列Y这种方式,辉煌左右。
  • 对于仅仅更换波什()的_FindStringBoyer()比名次()的纯ASM版本快促成德尔福通过快速code项目的人来说,这是目前用于POS机,如果你需要的个案不敏感,你能想象的性能提升时,我们不必调用大写一个100兆字节的字符串。 (好吧,你的字符串不会那么大。但尽管如此,高效的算法是一个美丽的事物。)

好吧,我写一个字符串替换博耶,摩尔风格:

Okay I wrote a String Replace in Boyer-Moore style:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
  errors:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
  CaseSensitive:Boolean;
  foundAt:Integer;
  lastFoundAt:Integer;
  copyStartsAt:Integer;
  copyLen:Integer;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := '';
  lastFoundAt := 0;
  fromPos := 1;
  errors := 0;
  CaseSensitive := rfIgnoreCase in Flags;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    foundAt := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin


        foundAt := Index - kIndex + 1;
        Index := Index + LengthPattern;
        //fromPos := Index;
        fromPos := (foundAt+LengthPattern);
        if lastFoundAt=0 then begin
                copyStartsAt := 1;
                copyLen := foundAt-copyStartsAt;
        end else begin
                copyStartsAt := lastFoundAt+LengthPattern;
                copyLen := foundAt-copyStartsAt;
        end;

        if (copyLen<=0)or(copyStartsAt<=0) then begin
                Inc(errors);
        end;

        Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
        lastFoundAt := foundAt;
        if not (rfReplaceAll in Flags) then
                 fromPos := 0; // break out of outer while loop too!
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin
  end;
  if (lastFoundAt=0) then
  begin
     // nothing was found, just return whole original string
      Result := aSourceString;
  end
  else
  if (lastFoundAt+LengthPattern < Limit) then begin
     // the part that didn't require any replacing, because nothing more was found,
     // or rfReplaceAll flag was not specified, is copied at the
     // end as the final step.
    copyStartsAt := lastFoundAt+LengthPattern;
    copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
    Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
  end;

end;

好了,问题:这个堆栈的足迹:

Okay, problem: Stack footprint of this:

var
  skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi

再见CPU地狱,你好栈地狱。如果我去一个动态数组,然后我在运行时调整其大小。所以这个事情基本上是快速的,因为你的计算机上的虚拟内存系统不眨眼在256K打算在栈上,但是这并不总是code的最佳作品。不过我的电脑不眨眼,在这样的大垛的东西。它不会成为一个Delphi标准库的默认或将来赢得任何快速code的挑战,这有点儿足迹。我认为,反复搜索是其中上述code应写为一类的情况下,和skiptable应该是在某一类的数据字段。然后,你可以建立博耶 - 穆尔表一次,随着时间的推移,如果字符串是不变的,可重复使用该对象做快速查找。

Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.

这篇关于是否有一个博耶 - 穆尔字符串搜索和快速搜索和替换功能,快速串计数2010年德尔福字符串(统一codeString的)在那里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆