快速字符串搜索以找到与给定模式匹配的字符串数组元素 [英] fast string search to find string array element which matches the givven pattern

查看:108
本文介绍了快速字符串搜索以找到与给定模式匹配的字符串数组元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个常量字符串数组,可以迭代该常量字符串以查找元素的索引,该索引是一个包含搜索模式的字符串.我应该选择哪种搜索算法来提高找到该元素的速度?在运行应用程序以准备查找表之前,我没有时间限制.

I have an array of constant strings which I iterate through to find an index of element which is a string that contains a search pattern. Which search algorithm should I choose to improve the speed of finding this element? I am not limited in time before running the application for preparing the look up tables if any are necessary.

我更正了一个问题-我没有在进行精确的字符串匹配-我正在搜索位于数组中的元素内部的模式

I corrected a question - I am not doing exact string match - I am searching for pattern inside the element, which is in an array

array of strings:
[0] Red fox jumps over the fence
[1] Blue table
[2] Red flowers on the fence

我需要找到一个包含单词'table'的元素-在这种情况下,它是元素1

I need to find an element which contains word 'table' - in this case its element 1

我喜欢一组30个数组的50000次迭代,该数组最多可以包含30000个不少于128个字符的字符串.现在我正在使用过时的strstr蛮力,太慢了...

I do like 50000 iterations of a set of 30 array which could contain up to 30000 strings of not less than 128 characters. Now I am using good-old strstr brute force which is too slow...

好吧,发布我函数的一部分,第一个strstr-如果出现任何未删减的行,则在未切割的行数组中查找,然后进行残酷搜索.我知道我可以加快这部分的速度,但是我并未对此方法进行优化...

Ok, posting a part of my function, the first strstr - looks up in an uncut array of lines if there are any occurrences, then the brute search follows. I know I can speed this part, but I am not doing optimization on this approach...

// sheets[i].buffer - contains a page of a text which is split into lines
// fullfunccall - is the pattern
// sheets[i].cells[k] - are the pointers to lines in a buffer

for( i=0; i<sheetcount; i++) {
  if( i!= currsheet && sheets[i].name && sheets[i].name[0] != '\0') {
    if( strstr(sheets[i].buffer, fullfunccall )) {
      usedexternally = 1;

      int foundatleastone = 0;
      for( k=0; k<sheets[i].numcells; k++ ) {
        strncpy_s(testline, MAX_LINE_SIZE, sheets[i].cells[k]->line, sheets[i].cells[k]->linesize);
        testline[sheets[i].cells[k]->linesize] = '\0';

        if( strstr(testline, fullfunccall )) {
          dependency_num++;

          if( dependency_num >= MAX_CELL_DEPENDENCIES-1) {
            printf("allocation for sheet cell dependencies is insuficcient\n");
            return;
          }

          sheets[currsheet].cells[currcellposinsheet]->numdeps = dependency_num+1;
          foundatleastone++;
          sheets[currsheet].cells[currcellposinsheet]->deps[dependency_num] = &sheets[i].cells[k];
        }
      }

      if( foundatleastone == 0 ) {
        printf("error locating dependency for external func: %s\n", fullfunccall );
        return;
      }
    }
  };
}

推荐答案

对于要处理的每个工作表,您都可以构建后缀数组,如

For each sheet that you are treating, you could build a suffix array as described in this article. Before you start your search, read the sheet, find the line beginnings (as integer indices into the sheet buffer), create the suffix array and sort it as described in the article.

现在,如果您要查找出现模式(例如表格")的行,则可以搜索表格"之后的下一个条目和"tablf"之后的下一个条目,这是第一个非匹配,您已将最右边的字母移动到里程表样式.

Now, if you are looking for the lines in which a pattern, say "table", occurs, you can search for the next entry after "table" and the next entry after "tablf", which is the first non-match, where you have moved the right-most letter, odometer-style.

如果两个索引相同,则没有匹配项.如果它们不同,您将在工作表中获得一个指针列表:

If both indices are the same, there are no matches. If they are different, you'll get a list of pointers into the sheet:

"tab. And now ..."
----------------------------------------------------------------
"table and ..."                0x0100ab30
"table water for ..."          0x0100132b
"tablet computer ..."          0x01000208
----------------------------------------------------------------
"tabloid reporter ..."

这将为您提供一个指针列表,通过减去工作表缓冲区的基指针,您可以获得整数偏移量.与行首的比较将为您提供与这些指针相对应的行号. (行号已排序,因此您可以在此处进行二进制搜索.)

This will give you a list of pointers from which, by subtracting the base pointer of the sheet buffer, you can get the integer offsets. Comparison with the line beginnings will give you the line numbers that correspond to these pointers. (The line numbers are sorted, so you can do binary search here.)

内存开销是一个指针数组,其大小与工作表缓冲区相同,因此对于30,000个200个字符的字符串,在64位计算机上约为48MB. (行索引的开销可以忽略不计.)

The memory overhead is an array of pointers that has the same size as the sheet buffer, so for 30,000 strings of 200 chars, that will be about 48MB on a 64-bit machine. (The overhead of the line indices is negligible.)

对数组进行排序会花费很长时间,但是每张纸只能进行一次.

Sorting the array will take long, but it is done only once for each sheet.

编辑:这个想法似乎很好.我已经实现了它,并且可以在文本文件中扫描大约130,000个单词的字典在不到一秒钟的时间内达到600k :

Edit: The idea seems to work well. I have implemented it and can scan a dictionary of about 130,000 words on a text file of nearly 600k in less then one second:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define die(...) exit((fprintf(stderr, "Fatal: " __VA_ARGS__), \
    putc(10, stderr), 1))



typedef struct Sheet Sheet;    

struct Sheet {
    size_t size;    /* Number of chars */
    char *buf;      /* Null-terminated char buffer */
    char **ptr;     /* Pointers into char buffer */
    size_t nline;   /* number of lines */
    int *line;      /* array of offset of line beginnings */
    size_t naux;    /* size of scratch array */
    char **aux;     /* scratch array */
};


/*
 *      Count occurrence of c in zero-terminated string p.
 */
size_t strcount(const char *p, int c)
{
    size_t n = 0;

    for (;;) {
        p = strchr(p, c);
        if (p == NULL) return n;
        p++;
        n++;        
    }

    return 0;
}

/*
 *      String comparison via pointers to strings.
 */
int pstrcmp(const void *a, const void *b)
{
    const char *const *aa = a;
    const char *const *bb = b;

    return strcmp(*aa, *bb);
}

/*
 *      Pointer comparison.
 */
int ptrcmp(const void *a, const void *b)
{
    const char *const *aa = a;
    const char *const *bb = b;

    if (*aa == *bb) return 0;   
    return (*aa < *bb) ? -1 : 1;
}

/*
 *      Create and prepare a sheet, i.e. a text file to search.
 */
Sheet *sheet_new(const char *fn)
{
    Sheet *sheet;
    FILE *f = fopen(fn, "r");
    size_t n;
    int last;
    char *p;
    char **pp;

    if (f == NULL) die("Couldn't open %s", fn);

    sheet = malloc(sizeof(*sheet));
    if (sheet == NULL) die("Allocation failed");

    fseek(f, 0, SEEK_END);
    sheet->size = ftell(f);
    fseek(f, 0, SEEK_SET);

    sheet->buf = malloc(sheet->size + 1);
    sheet->ptr = malloc(sheet->size * sizeof(*sheet->ptr));

    if (sheet->buf == NULL) die("Allocation failed");
    if (sheet->ptr == NULL) die("Allocation failed");

    fread(sheet->buf, 1, sheet->size, f);
    sheet->buf[sheet->size] = '\0';
    fclose(f);

    sheet->nline = strcount(sheet->buf, '\n');
    sheet->line = malloc(sheet->nline * sizeof(*sheet->line));

    sheet->aux = NULL;
    sheet->naux = 0;

    n = 0;
    last = 0;
    p = sheet->buf;
    pp = sheet->ptr;
    while (*p) {
        *pp++ = p;
        if (*p == '\n') {
            sheet->line[n++] = last;
            last = p - sheet->buf + 1;
        }
        p++;
    }

    qsort(sheet->ptr, sheet->size, sizeof(*sheet->ptr), pstrcmp);

    return sheet;
}

/*
 *      Clean up sheet.
 */
void sheet_delete(Sheet *sheet)
{
    free(sheet->buf);
    free(sheet->ptr);
    free(sheet->line);
    free(sheet->aux);
    free(sheet);
}

/*
 *      Binary range search for string pointers.
 */
static char **pstr_bsearch(const char *key,
    char **arr, size_t high)
{
    size_t low = 0;

    while (low < high) {
        size_t mid = (low + high) / 2;
        int diff = strcmp(key, arr[mid]);

        if (diff < 0) high = mid;
        else low = mid + 1;
    }

    return arr + low;
}

/*
 *      Binary range search for line offsets.
 */
static const int *int_bsearch(int key, const int *arr, size_t high)
{
    size_t low = 0;

    while (low < high) {
        size_t mid = (low + high) / 2;
        int diff = key - arr[mid];

        if (diff < 0) high = mid;
        else low = mid + 1;
    }

    if (low < 1) return NULL;
    return arr + low - 1;
}

/*
 *      Find occurrences of the string key in the sheet. Returns the
 *      number of lines in which the key occurs and assigns up to
 *      max lines to the line array. (If max is 0, line may be NULL.)
 */
int sheet_find(Sheet *sheet, char *key,
    int line[], int max)
{
    char **begin, **end;
    int n = 0;
    size_t i, m;
    size_t last;

    begin = pstr_bsearch(key, sheet->ptr, sheet->size);
    if (begin == NULL) return 0;

    key[strlen(key) - 1]++;
    end = pstr_bsearch(key, sheet->ptr, sheet->size);
    key[strlen(key) - 1]--;
    if (end == NULL) return 0;
    if (end == begin) return 0;

    m = end - begin;
    if (m > sheet->naux) {
        if (sheet->naux == 0) sheet->naux = 0x100;
        while (sheet->naux < m) sheet->naux *= 2;
        sheet->aux = realloc(sheet->aux, sheet->naux * sizeof(*sheet->aux));
        if (sheet->aux == NULL) die("Re-allocation failed");        
    }

    memcpy(sheet->aux, begin, m * sizeof(*begin));
    qsort(sheet->aux, m, sizeof(*begin), ptrcmp);

    last = 0;
    for (i = 0; i < m; i++) {
        int offset = sheet->aux[i] - sheet->buf;
        const int *p;

        p = int_bsearch(offset, sheet->line + last, sheet->nline - last);

        if (p) {
            if (n < max) line[n] = p - sheet->line;
            last = p - sheet->line + 1;
            n++;
        }
    }

    return n;
}

/*
 *      Example client code
 */
int main(int argc, char **argv)
{
    Sheet *sheet;
    FILE *f;

    if (argc != 3) die("Usage: %s patterns corpus", *argv);

    sheet = sheet_new(argv[2]);

    f = fopen(argv[1], "r");
    if (f == NULL) die("Can't open %s.", argv[1]);
    for (;;) {
        char str[80];
        int line[50];
        int i, n;

        if (fgets(str, sizeof(str), f) == NULL) break;
        strtok(str, "\n");
        n = sheet_find(sheet, str, line, 50);
        printf("%8d %s\n", n, str);

        if (n > 50) n = 50;
        for (i = 0; i < n; i++) printf("    [%d] %d\n", i, line[i] + 1);
    }
    fclose(f);

    sheet_delete(sheet);

    return 0;
}

该实现有其粗糙的边缘,但是可以工作.我不特别喜欢暂存数组和在找到的指针范围上进行的其他排序,但是事实证明,即使对大型后缀数组进行排序也不会花费太长时间.

The implementation has its rough edges, but it works. I'm not especially fond of the scratch array and the additional sorting on the found pointer range, but it turns out that even sorting the large suffix array doesn't take too long.

如果愿意,可以将此解决方案扩展到更多工作表.

You can extend this solution to more sheets, if you like.

这篇关于快速字符串搜索以找到与给定模式匹配的字符串数组元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆