编辑距离矩阵 [英] Edit Distance Matrix

查看:99
本文介绍了编辑距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个程序,该程序使用两个字符串并为其填充编辑距离矩阵。让我烦恼的是,对于第二个字符串输入,它跳过了第二个输入。我曾尝试使用getch()清除缓冲区,但没有成功。我也尝试过切换到scanf(),但这也导致了某些崩溃。

I'm trying to build a program that takes two strings and fills in the edit distance matrix for them. The thing that is tripping me up is, for the second string input, it is skipping over the second input. I've tried clearing the buffer with getch(), but it didn't work. I've also tried switching over to scanf(), but that resulted in some crashes as well. Help please!

代码:

#include <stdio.h>
#include <stdlib.h>

int min(int a, int b, int c){
    if(a > b && a > c)
        return a;
    else if(b > a && b > c)
        return b;
    else
        return c;
}

int main(){

    // allocate size for strings
    int i, j;
    char *input1 = (char*)malloc(sizeof(char)*100);
    char *input2 = (char*)malloc(sizeof(char)*100);

    // ask for input
    printf("Enter the first string: ");
    fgets(input1, sizeof(input1), stdin);
    printf("\nEnter the second string: ");
    fgets(input2, sizeof(input2), stdin);

    // make matrix
    int len1 = sizeof(input1), len2 = sizeof(input2);
    int c[len1 + 1][len2 + 1];

    // set up input 2 length
    for(i = 0; i < len2 + 1; i++){
        c[0][i] = i;
    }

    // set up input 1 length
    for(i = 0; i < len1 + 1; i++){
        c[i][0] = i;
    }

    // fill in the rest of the matrix
    for(i = 1; i < len1; i++){
        for(j = 1; j < len2; j++){
            if(input1[i] == input2[j]) // if the first letters are equal make the diagonal equal to the last
                c[i][j] = c[i - 1][j - 1];
            else
                c[i][j] = 1 + min(c[i - 1][j - 1], c[i - 1][j], c[i][j - 1]);
        }
    }

    // print the matrix
    printf("\n");
    for(j = 0; j < len2; j++){
        for(i = 0; i < len1; i++){
            printf("|  %d", c[i][j]);
        }
        printf("\n");
    }

    return 1;
}


推荐答案

用<$ c $粘贴c> fgets 。

正如其他人指出的那样,使用 char input1 [100] 而不是 char * input1 = malloc(...)

As others have pointed out, use char input1[100] instead of char *input1 = malloc(...)

但是,即使进行了更改,这也使得内的 sizeof 正确,使用 sizeof 时设置 len1 len2 是错误的。即使它们中只有10个有效字符(即,其余字符是不确定的/随机的),您也将处理100个整个缓冲区。

But, even with that change, which makes the sizeof inside of the fgets correct, using sizeof when setting up len1 and len2 is wrong. You'll be processing an entire buffer of 100, even if their are only 10 valid characters in it (i.e. the remaining ones are undefined/random).

您[可能]想要的是 strlen [和换行符]以获取实际的有效长度。

What you [probably] want is strlen [and a newline strip] to get the actual useful lengths.

这是修改后的代码[请原谅免费的样式清理]:

Here's the modified code [please pardon the gratuitous style cleanup]:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int
min(int a, int b, int c)
{
    if (a > b && a > c)
        return a;
    if (b > a && b > c)
        return b;
    return c;
}

int
main(void)
{

    // allocate size for strings
    int i;
    int j;
    char input1[100];
    char input2[100];

    // ask for input
    printf("Enter the first string: ");
    fgets(input1, sizeof(input1), stdin);
    int len1 = strlen(input1);
    if (input1[len1 - 1] == '\n') {
        input1[len1 - 1] = 0;
        --len1;
    }

    printf("\nEnter the second string: ");
    fgets(input2, sizeof(input2), stdin);
    int len2 = strlen(input2);
    if (input2[len2 - 1] == '\n') {
        input2[len2 - 1] = 0;
        --len2;
    }

    // make matrix
    int c[len1 + 1][len2 + 1];

    // set up input 2 length
    for (i = 0; i < len2 + 1; i++) {
        c[0][i] = i;
    }

    // set up input 1 length
    for (i = 0; i < len1 + 1; i++) {
        c[i][0] = i;
    }

    // fill in the rest of the matrix
    for (i = 1; i < len1; i++) {
        for (j = 1; j < len2; j++) {
            // if the 1st letters are equal make the diagonal equal to the last
            if (input1[i] == input2[j])
                c[i][j] = c[i - 1][j - 1];
            else
                c[i][j] = 1 + min(c[i - 1][j - 1], c[i - 1][j], c[i][j - 1]);
        }
    }

    // print the matrix
    printf("\n");
    for (j = 0; j < len2; j++) {
        for (i = 0; i < len1; i++) {
            printf("|  %d", c[i][j]);
        }
        printf("\n");
    }

    return 1;
}






更新:


好吧,我明白你的意思了!我尝试使用malloc的原因是为了避免生成必须打印100x100空白空间的矩阵。

Okay sweet I see what you mean! The reason I was trying to use malloc though was to avoid making the matrix that I had to print a size of 100x100 blank spaces.

如果使用固定大小的 input1 malloc ed一个,则 fgets 仅将其填充为输入的输入大小(必要时剪切至第二个参数)。但是,它不会任何内容(例如,右边的空格)填充/填充缓冲区的其余部分。 要做的是在从输入中读取的最后一个字符[通常是换行符]之后添加一个EOS [字符串结尾]字符[它是二进制0x00]。

With either the fixed size input1 or the malloced one, fgets will only fill it to the input size entered [clipped to the second argument, if necessary]. But, it does not pad/fill the remainder of the buffer with anything (e.g. spaces on the right). What it does do is add an EOS [end-of-string] character [which is a binary 0x00] after the last char read from input [which is usually the newline].

因此,如果输入字符串为: abcdef\n ,则其长度[可从 strlen获得]为7,input [7]将为0x00,而 input1 [8] input1 [99] 将具有未定义/随机/不可预测的值和不是空格。

Thus, if the input string is: abcdef\n, the length [obtainable from strlen] is 7, input[7] will be 0x00, and input1[8] through input1[99] will have undefined/random/unpredictable values and not spaces.

由于换行符不是非常有用,因此通常剥离后再进一步处理。例如,在计算一个小词组的编辑距离时,这并不是非常相关。

Since a newline char isn't terribly useful, it is often stripped out before further processing. For example, it isn't terribly relevant when computing edit distance for a small phrase.


使用strlen()只会计算数组中是否包含char,还是它也包含所有空格?

Does using strlen() only count the number of chars inside the array, or does it include all the blank spaces too?

如上所述, fgets 不会 end 处填充字符串,因此,不必担心。

As I mentioned above, fgets does not pad the string at the end, so, not to worry. It will do what you want/expect.

strlen 只会计算不超过[但 ,包括EOS终止符(即零)。如果其中一些字符恰好是空格,则 会被 strlen 计数-这就是我们想要的。

strlen only counts chars up to [but not including the EOS terminator character (i.e.) zero]. If some of these chars happen to be spaces, they will be counted by strlen--which is what we want.

考虑计算以下两个短语之间的编辑距离:

Consider computing the edit distance between any two of the following phrases:

quick brown fox jumped over the lazy dogs
the quick brown fox jumped over lazy dogs
quick brown fox jumps over the lazy dog

在每种情况下,我们想要 strlen 包含[内部/嵌入式]空格在长度计算中。这是因为 完全有效地计算短语的编辑距离。

In each case, we want strlen to include the [internal/embedded] spaces in the length calculation. That's because it is perfectly valid to compute the edit distance of phrases.

malloc 的有效用法:当数据量太大而无法容纳在堆栈中时。大多数系统都有默认限制(例如,在Linux下为8 MB)。

There is a valid usage for malloc: when the amount of data is too big to fit on the stack. Most systems have a default limit (e.g. under linux, it's 8 MB).

假设我们正在计算[从文件读取]的两本书章节的编辑距离, d有(例如):

Suppose we were computing the edit distance for two book chapters [read from files], we'd have (e.g.):

char input1[50000];
char input2[50000];

上面的内容适合,但 c 矩阵会导致堆栈溢出:

The above would fit, but the c matrix would cause a stack overflow:

int c[50000][50000];

因为它的大小为 50000 * 50000 * 4 大约9.3 GB。

because the size of this would be 50000 * 50000 * 4 which is approx 9.3 GB.

因此,要容纳所有这些数据,我们需要在堆上分配它。尽管可以为 c 进行 malloc 并保持2D矩阵访问,我们必须创建一个函数并传递指向 c 的指针。

So, to fit all this data, we'd need to allocate it on the heap. While it is possible to do a malloc for c and maintain the 2D matrix access, we'd have to create a function and pass off the pointer to c to it.

所以,这是一个修改后的版本需要输入任意大尺寸:

So, here's a modified version that takes input of arbitrarily large sizes:

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

#define sysfault(_fmt...) \
    do { \
        fprintf(stderr,_fmt); \
        exit(1); \
    } while (0)

#define C(y,x)      c[((y) * (len2 + 1)) + (x)]

long
min(long a, long b, long c)
{
    if (a > b && a > c)
        return a;
    if (b > a && b > c)
        return b;
    return c;
}

char *
input(const char *prompt,long *lenp,const char *file)
{
    FILE *fp;
    char *lhs;
    int chr;
    long siz;
    long len;

    if (file != NULL)
        fp = fopen(file,"r");
    else {
        fp = stdin;
        printf("Enter %s string: ",prompt);
        fflush(stdout);
    }

    lhs = NULL;
    siz = 0;
    len = 0;

    while (1) {
        chr = fgetc(fp);
        if (chr == EOF)
            break;

        if ((chr == '\n') && (file == NULL))
            break;

        // grow the character array
        if ((len + 1) >= siz) {
            siz += 100;
            lhs = realloc(lhs,siz);
            if (lhs == NULL)
                sysfault("input: realloc failure -- %s\n",strerror(errno));
        }

        lhs[len] = chr;
        len += 1;
    }

    if (file != NULL)
        fclose(fp);

    if (lhs == NULL)
        sysfault("input: premature EOF\n");

    // add the EOS
    lhs[len] = 0;

    // return the length to the caller
    *lenp = len;

    return lhs;
}

int
main(int argc,char **argv)
{
    long i;
    long j;
    char *input1;
    long len1;
    char *input2;
    long len2;
    long *c;

    --argc;
    ++argv;

    switch (argc) {
    case 2:
        input1 = input("first",&len1,argv[0]);
        input2 = input("second",&len2,argv[1]);
        break;
    default:
        input1 = input("first",&len1,NULL);
        input2 = input("second",&len2,NULL);
        break;
    }

    // make matrix
    c = malloc(sizeof(*c) * (len1 + 1) * (len2 + 1));
    if (c == NULL)
        sysfault("main: malloc failure -- %s\n",strerror(errno));

    // set up input 2 length
    for (i = 0; i < len2 + 1; i++) {
        C(0,i) = i;
    }

    // set up input 1 length
    for (i = 0; i < len1 + 1; i++) {
        C(i,0) = i;
    }

    // fill in the rest of the matrix
    for (i = 1; i < len1; i++) {
        for (j = 1; j < len2; j++) {
            // if the 1st letters are equal make the diagonal equal to the last
            if (input1[i] == input2[j])
                C(i,j) = C(i - 1,j - 1);
            else
                C(i,j) = 1 + min(C(i - 1,j - 1), C(i - 1,j), C(i,j - 1));
        }
    }

    // print the matrix
    printf("\n");
    for (j = 0; j < len2; j++) {
        for (i = 0; i < len1; i++) {
            printf("|  %ld", C(i,j));
        }
        printf("\n");
    }

    return 1;
}

这篇关于编辑距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆