通配符字符串搜索算法 [英] Wildcard String Search Algorithm

查看:78
本文介绍了通配符字符串搜索算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的程序中,我需要在一个很大的字符串(〜1 mb)中搜索一个相对小的子字符串(<1 kb).问题是字符串包含"a?c"意义上的简单通配符,这意味着我想搜索诸如"abc"或"apc"之类的字符串,...(我只对第一次出现的字符串感兴趣)./p>

直到现在,我都使用平凡的方法(此处为伪代码)

 算法搜索",输入:干草堆(字符串),针(字符串)for(i = 0,i< length(干草堆),++ i)if(!CompareMemory(haystack + i,needle,length(needle)))返回我返回-1;(未找到) 

如果第一个和第二个参数相同(也与通配符有关),则"CompareMemory"返回0,仅关于第三个参数给出的字节数.

我的问题是,现在是否有一种快速的算法(您不必给出它,但是如果这样做,我希望使用c ++,c或伪代码).我开始此处但是我认为大多数快速算法都不允许使用通配符(通过它们利用字符串的性质).

我希望问题的格式可以,因为我是新来的,请多谢!!

解决方案

在c个字母的字母字符串中,让S的长度为s,让T_1 ... T_k的平均长度为b.将搜索k个目标字符串中的每一个.(问题陈述没有提到给定字符串的多次搜索;我在下面提到它,因为在这种范例中我的程序运行良好.)

程序使用O(s + c)的时间和空间进行设置,并且(如果S和T_i是随机字符串)O(k * u * s/c)+ O(k * b + k * b *s/c ^ u)进行搜索的总时间,程序中u = 3,如图所示.对于更长的目标,应增加u并选择稀有且分隔广泛的关键字符.

在步骤1中,程序创建一个由s + TsizMax整数组成的数组L(在程序中,TsizMax =允许的目标长度),并将其用于下一个出现字符的c个位置列表,列表头位于H []和T []中的尾巴.这是O(s + c)的时空步长.

在步骤2中,程序反复读取和处理目标字符串.步骤2A选择u = 3个不同的非通配符(在当前目标中).如图所示,该程序仅使用前三个这样的字符.只需做一点点的工作,它就可以使用目标中最稀有的字符来提高性能.请注意,它不能处理少于三个这样的字符的目标.

行"L [T [r]] = L [g + i] = g + i;"在步骤2A中,将在L中设置一个具有适当delta偏移量的保护单元,以使步骤2G将在搜索结束时自动执行,而无需在搜索过程中进行任何额外的测试.T [r]为字符r索引列表的末尾单元格,因此单元格L [g + i]成为字符r的新的自引用列表末尾.(这项技术允许循环以最少的外部条件测试运行.)

第2B步将vars a,b,c设置为列表的顶部,并根据目标中所选关键字符之间的距离设置增量dab,dac和dbc.

步骤2C检查关键字符是否出现在S中.此步骤是必需的,因为否则将挂起步骤2E中的while循环.我们不希望在while循环中进行更多检查,因为它们是搜索的内部循环.

第2D步执行第2E到2i步,直到var c指向S的末尾为止,此时不可能再进行任何匹配了.

步骤2E由u = 3个while循环组成,即强制增量距离",即,爬网索引a,b,c彼此重叠,只要它们不兼容模式.while循环相当快,对于各种v,d,w,每个循环本质上都是(除去++ si工具)"while(v + d< w)v = L [v]".重复复制三个while循环几次可能会提高性能,并且不会改变最终结果.

在步骤2G中,我们知道u键字符匹配,因此我们使用通配符处理对目标与匹配点进行了完全比较.步骤2H报告比较结果.给出的程序还报告了本节中的不匹配项;删除生产中的内容.

步骤2I推进了所有关键字符索引,因为当前索引的字符都不能成为另一个匹配项的关键部分.

您可以运行该程序以查看一些操作计数统计信息.例如,输出

 目标5 =< de?ga>012345678901234567890123456789012345678901abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg@ 17,de?ga和de3ga比赛@ 24,de?ga和defg4不同@ 31,de?ga和defga比赛进展:'d'0 + 3'e'3 + 3'g'3 + 3 = 6 + 9 = 15 

显示步骤2G被输入3次(即,关键字符匹配3次);全面比较成功了两次;步骤2E,同时循环高级索引6次;步骤2:将索引提前9次;总共有15个改进,可以搜索42个字符的字符串作为de?ga目标.

 /* jiw$ Id:stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $回复:用于在长字符串中搜索短目标的概念代码,目标中可能包含通配符.用户可以输入任意数量的目标作为命令行参数.此代码有2个长字符串可用于测试;如果第一个第一个参数的字符为"1",使用了杰伊[42]字符串,否则,凯[321].例如,对于带有* hay = jay的测试,请使用以下命令./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee?ddee?f或* hay = kay,./stringsearch BC?e?ji?pa?j?av ?? j锻炼程序.版权所有2011 James Waldby.提供而没有保修根据GPL v3条款(如http://www.gnu.org/licenses/gpl.html)*/#include< stdlib.h>#include< stdio.h>#include< string.h>#include< limits.h>//================================================int main(int argc,char * argv []){char jay [] ="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";char kay [] ="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja""etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii""etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii""ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach""ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";char * hay =(argc> 1&& argv [1] [0] =='1')?杰伊:凯;枚举{chars = 1<< CHAR_BIT,TsizMax = 40,Lsiz = TsizMax + sizeof kay,L1,L2};int L [L2],H [chars],T [chars],g,k,par;//步骤1.制作数组L,H,T.对于(k = 0; k <字符; ++ k),H [k] = T [k] = L1;//初始化H和Tfor(g = 0; hay [g]; ++ g){//创建干草的链接字符列表.k =干草[g];//在同一循环中,可以计算字符频率.如果(T [k] == L1)H [k] = T [k] = g;T [k] = L [T [k]] = g;}//步骤2.读取和处理目标字符串.对于(par = 1; par< argc; ++ par){int alpha [3],at [3],a = g,b = g,c = g,da,dab,dbc,dac,i,j,r;char * targ = argv [par];枚举{wild ='?'};int sa = 0,sb = 0,sc = 0,ta = 0,tb = 0,tc = 0;printf("Target%d =<%s> \ n",par,targ);//步骤2A.选择以下3个非野生字符.//照原样,为a,b,c选择前3个非通配符.//而是可以选择3个最稀有的字符.对于(j = 0; j< 3; ++ j)alpha [j] = -j;对于(i = j = 0; targ [i]& j< 3; ++ i)if(targ [i]!= wild){r = alpha [j] = targ [i];如果(alpha [0] == alpha [1] || alpha [1] == alpha [2]||alpha [0] == alpha [2])继续;at [j] = i;L [T [r]] = L [g + i] = g + i;++ j;}如果(j!= 3){printf(目标字符太少\ n");继续;}//步骤2B.将a,b,c设置为列表顶部位置,并设置增量.da = at [0];a = H [alpha [0]];dab = at [1] -at [0];b = H [α[1]];dbc = at [2] -at [1];c = H [α[2]];dac = at [2] -at [0];//步骤2C.查看关键字符是否出现在干草堆中如果(a> = g || b> = g || c> = g){printf(某些字符不匹配\ n");继续;}对于(g = 0; hay [g]; ++ g)printf(%d",g%10);printf("\ n%s \ n",干草);//显示干草堆,为用户提供帮助//步骤2D.搜索比赛而(c< g){//步骤2E.强制增量距离而(a + dab< b){a = L [a];++ sa;}//复制这些而(b + dbc< c){b = L [b];++ sb;}//3条abc行而(a + dac> c){c = L [c];++ sc;}//随心所欲.而(a + dab< b){a = L [a];++ sa;}//复制这些而(b + dbc< c){b = L [b];++ sb;}//3条abc行而(a + dac> c){c = L [c];++ sc;}//随心所欲.//步骤2F.查看是否满足增量距离if(a + dab == b& b + dbc == c& c< c)g {//步骤2G.是的,所以我们有3个字母的比赛,需要测试整个比赛.r = a-da;对于(k = 0; targ [k]; ++ k)如果((hay [r + k]!= targ [k])&&(targ [k]!= wild))休息;printf("@%3d,%s和",r,targ);为(i = 0; targ [i]; ++ i)putchar(hay [r ++]);//步骤2H.报告匹配(如果找到)puts(targ [k]?"different":"match");//步骤2I.前进所有a,b,c,继续寻找a = L [a];++ ta;b = L [b];++ tb;c = L [c];++ tc;}}printf(前进:'%c'%d +%d'%c'%d +%d'%c'%d +%d =%d +%d =%d \ n",alpha [0],sa,ta,alpha [1],sb,tb,alpha [2],sc,tc,sa + sb + sc,ta + tb + tc,sa + sb + sc + ta + tb + tc);}返回0;} 

请注意,如果您比当前的首选答案更喜欢此答案,请取消标记该答案并标记为该答案.:)

In my program I need to search in a quite big string (~1 mb) for a relatively small substring (< 1 kb). The problem is the string contains simple wildcards in the sense of "a?c" which means I want to search for strings like "abc" or also "apc",... (I am only interested in the first occurence).

Until now I use the trivial approach (here in pseudocode)

algorithm "search", input: haystack(string), needle(string)

for(i = 0, i < length(haystack), ++i)
 if(!CompareMemory(haystack+i,needle,length(needle))
  return i;

return -1; (Not found)

Where "CompareMemory" returns 0 iff the first and second argument are identical (also concerning wildcards) only regarding the amount of bytes the third argument gives.

My question is now if there is a fast algorithm for this (you don't have to give it, but if you do I would prefer c++, c or pseudocode). I started here but I think most of the fast algorithms don't allow wildcards (by the way they exploit the nature of strings).

I hope the format of the question is ok because I am new here, thank you in advance!

解决方案

Among strings over an alphabet of c characters, let S have length s and let T_1 ... T_k have average length b. S will be searched for each of the k target strings. (The problem statement doesn't mention multiple searches of a given string; I mention it below because in that paradigm my program does well.)

The program uses O(s+c) time and space for setup, and (if S and the T_i are random strings) O(k*u*s/c) + O(k*b + k*b*s/c^u) total time for searching, with u=3 in program as shown. For longer targets, u should be increased, and rare, widely-separated key characters chosen.

In step 1, the program creates an array L of s+TsizMax integers (in program, TsizMax = allowed target length) and uses it for c lists of locations of next occurrences of characters, with list heads in H[] and tails in T[]. This is the O(s+c) time and space step.

In step 2, the program repeatedly reads and processes target strings. Step 2A chooses u = 3 different non-wild key characters (in current target). As shown, the program just uses the first three such characters; with a tiny bit more work, it could instead use the rarest characters in the target, to improve performance. Note, it doesn't cope with targets with fewer than three such characters.

The line "L[T[r]] = L[g+i] = g+i;" within Step 2A sets up a guard cell in L with proper delta offset so that Step 2G will automatically execute at end of search, without needing any extra testing during the search. T[r] indexes the tail cell of the list for character r, so cell L[g+i] becomes a new, self-referencing, end-of-list for character r. (This technique allows the loops to run with a minimum of extraneous condition testing.)

Step 2B sets vars a,b,c to head-of-list locations, and sets deltas dab, dac, and dbc corresponding to distances between the chosen key characters in target.

Step 2C checks if key characters appear in S. This step is necessary because otherwise a while loop in Step 2E will hang. We don't want more checks within those while loops because they are the inner loops of search.

Step 2D does steps 2E to 2i until var c points to after end of S, at which point it is impossible to make any more matches.

Step 2E consists of u = 3 while loops, that "enforce delta distances", that is, crawl indexes a,b,c along over each other as long as they are not pattern-compatible. The while loops are fairly fast, each being in essence (with ++si instrumentation removed) "while (v+d < w) v = L[v]" for various v, d, w. Replicating the three while loops a few times may increase performance a little and will not change net results.

In Step 2G, we know that the u key characters match, so we do a complete compare of target to match point, with wild-character handling. Step 2H reports result of compare. Program as given also reports non-matches in this section; remove that in production.

Step 2I advances all the key-character indexes, because none of the currently-indexed characters can be the key part of another match.

You can run the program to see a few operation-count statistics. For example, the output

Target 5=<de?ga>
012345678901234567890123456789012345678901
abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg

@  17, de?ga  and  de3ga  match
@  24, de?ga  and  defg4  differ
@  31, de?ga  and  defga  match
Advances:  'd' 0+3  'e' 3+3  'g' 3+3  = 6+9 = 15

shows that Step 2G was entered 3 times (ie, the key characters matched 3 times); the full compare succeeded twice; step 2E while loops advanced indexes 6 times; step 2I advanced indexes 9 times; there were 15 advances in all, to search the 42-character string for the de?ga target.

/* jiw 
$Id: stringsearch.c,v 1.2 2011/08/19 08:53:44 j-waldby Exp j-waldby $

Re: Concept-code for searching a long string for short targets,
    where targets may contain wildcard characters.

The user can enter any number of targets as command line parameters.
This code has 2 long strings available for testing; if the first
character of the first parameter is '1' the jay[42] string is used,
else kay[321].

Eg, for tests with *hay = jay use command like

   ./stringsearch 1e?g a?cd bc?e?g c?efg de?ga ddee? ddee?f

or with *hay = kay,

   ./stringsearch  bc?e?  jih?  pa?j  ?av??j

to exercise program.

Copyright 2011 James Waldby.  Offered without warranty
under GPL v3 terms as at http://www.gnu.org/licenses/gpl.html
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
//================================================
int main(int argc, char *argv[]) {
  char jay[]="abc1efgabc2efgabcde3gabcdefg4bcdefgabc5efg";
  char kay[]="ludehkhtdiokihtmaihitoia1htkjkkchajajavpajkihtijkhijhipaja"
    "etpajamhkajajacpajihiatokajavtoia2pkjpajjhiifakacpajjhiatkpajfojii"
    "etkajamhpajajakpajihiatoiakavtoia3pakpajjhiifakacpajjhkatvpajfojii"
    "ihiifojjjjhijpjkhtfdoiajadijpkoia4jihtfjavpapakjhiifjpajihiifkjach"
    "ihikfkjjjjhijpjkhtfdoiajakijptoik4jihtfjakpapajjkiifjpajkhiifajkch";
  char *hay = (argc>1 && argv[1][0]=='1')? jay:kay;

  enum { chars=1<<CHAR_BIT, TsizMax=40, Lsiz=TsizMax+sizeof kay, L1, L2 };
  int L[L2], H[chars], T[chars], g, k, par;

  // Step 1.  Make arrays L, H, T.
  for (k=0; k<chars; ++k) H[k] = T[k] = L1; // Init H and T
  for (g=0; hay[g]; ++g) {  // Make linked character lists for hay.
    k = hay[g];            // In same loop, could count char freqs.
    if (T[k]==L1) H[k] = T[k] = g;
    T[k] = L[T[k]] = g;
  }

  // Step 2.  Read and process target strings.
  for (par=1; par<argc; ++par) {
    int alpha[3], at[3], a=g, b=g, c=g, da, dab, dbc, dac, i, j, r;
    char * targ = argv[par];
    enum { wild = '?' };
    int sa=0, sb=0, sc=0, ta=0, tb=0, tc=0;
    printf ("Target %d=<%s>\n", par, targ);

    // Step 2A.  Choose 3 non-wild characters to follow.
    // As is, chooses first 3 non-wilds for a,b,c.
    // Could instead choose 3 rarest characters.
    for (j=0; j<3; ++j) alpha[j] = -j;
    for (i=j=0; targ[i] && j<3; ++i)
      if (targ[i] != wild) {
        r = alpha[j] = targ[i];
        if (alpha[0]==alpha[1] || alpha[1]==alpha[2]
            || alpha[0]==alpha[2]) continue;
        at[j] = i;
        L[T[r]] = L[g+i] = g+i;
        ++j;
      }
    if (j != 3) {
      printf ("  Too few target chars\n"); 
      continue;
    }

    // Step 2B.  Set a,b,c to head-of-list locations, set deltas.
    da  = at[0];
    a = H[alpha[0]];  dab = at[1]-at[0];
    b = H[alpha[1]];  dbc = at[2]-at[1];
    c = H[alpha[2]];  dac = at[2]-at[0];

    // Step 2C.  See if key characters appear in haystack
    if (a >= g || b >= g || c >= g) {
      printf ("  No match on some character\n");
      continue;      
    }

    for (g=0; hay[g]; ++g) printf ("%d", g%10);
    printf ("\n%s\n", hay);    // Show haystack, for user aid

    // Step 2D.  Search for match
    while (c < g) {
      // Step 2E.  Enforce delta distances
      while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
      while (b+dbc < c) {b = L[b]; ++sb; } //  3 abc lines as many 
      while (a+dac > c) {c = L[c]; ++sc; } //   times as you like.
      while (a+dab < b) {a = L[a]; ++sa; } // Replicate these
      while (b+dbc < c) {b = L[b]; ++sb; } //  3 abc lines as many 
      while (a+dac > c) {c = L[c]; ++sc; } //   times as you like.

      // Step 2F.  See if delta distances were met
      if (a+dab==b && b+dbc==c && c<g) {
        // Step 2G.  Yes, so we have 3-letter-match and need to test whole match.
        r = a-da;
        for (k=0; targ[k]; ++k)
          if ((hay[r+k] != targ[k]) && (targ[k] != wild))
            break;
        printf ("@ %3d, %s  and  ", r, targ);
        for (i=0; targ[i]; ++i) putchar(hay[r++]);
        // Step 2H.  Report match, if found
        puts (targ[k]? "  differ" : "  match");
        // Step 2I.  Advance all of a,b,c, to go on looking
        a = L[a]; ++ta;
        b = L[b]; ++tb;
        c = L[c]; ++tc;
      }
    }
    printf ("Advances:  '%c' %d+%d  '%c' %d+%d  '%c' %d+%d  = %d+%d = %d\n",
        alpha[0], sa,ta, alpha[1], sb,tb, alpha[2], sc,tc,
        sa+sb+sc, ta+tb+tc, sa+sb+sc+ta+tb+tc);
  }
  return 0;
}

Note, if you like this answer better than current preferred answer, unmark that one and mark this one. :)

这篇关于通配符字符串搜索算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆