为什么我的levenshtein距离计算器无法处理PDF文件? [英] Why my levenshtein distance calculator fails with PDF file?

查看:54
本文介绍了为什么我的levenshtein距离计算器无法处理PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个程序,该程序计算两个文件之间的编辑距离.我以函数fread进行读取,并使用代码读取二进制("rb").我输入了两个PDF文件,并在调试过程中发现,当我尝试填充Levenshtein距离算法的矩阵时,会收到一个"SIGSEGV(分段错误)".在第一个文件的字符编号1354处,程序退出:

I'm trying to create a program that calculate edit distance between two files. I read with the funcution fread and I use the code to read binary ("rb"). I put in input two PDF files and during the debug I found out that when I try to fill the matrix of the Levenshtein distance algorithm I get a "SIGSEGV (Segmentation fault)" at char n° 1354 of the first file and the program exit with:

进程结束,退出代码为-1073741819(0xC0000005)

Process finished with exit code -1073741819 (0xC0000005)

我控制并且1354字符为\ n.

I controlled and char n° 1354 is \n .

我用来读取文件的代码是:

The code that I use to read the files is:

long getFileSize(FILE *file) {
long int size;
fseek(file, 0, SEEK_END);
size = ftell(file);
fseek(file, 0, SEEK_SET);
return size;
}

char *readFromBinary(char *path) {
FILE *file;
file = fopen(path, "rb");
if (file == NULL)
    printf("Error!\n");

long fileSize = getFileSize(file);
char *buffer = malloc((fileSize + 1) * sizeof(char));

fread(buffer, sizeof(char), fileSize, file);
return buffer;
}

这是我用来计算编辑距离的代码:

This is the code that I use to calculate the edit distance:

int calculateDistance(char *pathFile1, char *pathFile2, int choice, char *path) {
FILE *f1 = fopen(pathFile1, "rb");
FILE *f2 = fopen(pathFile2, "rb");
char *contentFile1 = readFromBinary(pathFile1);
char *contentFile2 = readFromBinary(pathFile2);

int distance = 0;
int dim1 = getFileSize(f1);
int dim2 = getFileSize(f2);

int **matrix = constructMatrix(dim1, dim2);
fillMatrix(matrix, dim1, dim2, contentFile1, contentFile2);

distance = matrix[dim1][dim2];
struct Instruction instruction[distance + 1];

int initActions = initInstructions(matrix, pathFile1, &dim1, pathFile2, &dim2, instruction);
endInstructions(pathFile1, &dim1, pathFile2, &dim2, instruction, initActions);

if (choice == 1)
    printOnFile(instruction, distance, path);

for (int i = 0; i <= dim1; i++)
    free(matrix[i]);
free(matrix);

if (numberOfDivisions > 0)
    numberOfDivisions--;

return distance;
}

这是我用来创建和填充矩阵的代码:

And this is the code that i use to create and fill the matrix:

int **constructMatrix(int dim1, int dim2) {
//matrice di puntatori
int **matrice = (int **) malloc((dim1 + 1) * sizeof(int *));

//matrice di puntatori
for (int i = 0; i <= dim1; i++)
    matrice[i] = (int *) malloc((dim2 + 1) * sizeof(int));

return matrice;
}

 void fillMatrix(int **matrix, int dim1, int dim2, char *file1, char *file2) {
  for (int i = 0; i <= dim1; i++)
    matrix[i][0] = i;
  for (int j = 1; j <= dim2; j++)
    matrix[0][j] = j;
  for (int i = 1; i <= dim1; i++) {
    for (int j = 1; j <= dim2; j++) {
        if (file1[i - 1] != file2[j - 1]) {
            int k = minimum(matrix[i][j - 1], matrix[i - 1][j], matrix[i - 1][j - 1]);
            matrix[i][j] = k + 1;
        } else
            matrix[i][j] = matrix[i - 1][j - 1];
    }
  }
}

特别是,调试器在此行中的calculateDistance( fillMatrix(matrix,dim1,dim2,contentFile1,contentFile2); )和在此行中的fillMatrix( matrix [i][0] = i; ),当i = 1354时.

In particular the debugger stops in this line of calculateDistance(fillMatrix(matrix, dim1, dim2, contentFile1, contentFile2);), and in this line of fillMatrix(matrix[i][0] = i;) when i=1354.

有关PDF的信息:

PDF文件为188671字节

The PDF file is 188671 byte

它有1355行

PS.我的程序可以处理txt文件.

PS. My program works with txt files.

推荐答案

何时有任何内存分配功能,包括 calloc 连续内存,该函数将返回 NULL.由于您要求的块大小令人难以置信,因此它很可能会失败.

When any of the memory allocation functions, including malloc, calloc, and realloc() make a request to the OS to obtain memory, unless the OS can find a single block of contiguous memory of the size requested, the function will return NULL. Since you are asking for a block of incredible size, it is likely to fail.

始终建议在尝试使用返回的值之前测试以下任何函数的返回值:

It is always recommended that the return of any of these functions is tested before attempting to use the value that was returned:

char *buffer = malloc((fileSize + 1) * sizeof(char));
if(!buffer)
{
    //handle error

在这种情况下,最好重新评估您的算法.

And in this case, it would be good to re-evaluate your algorithm.

这篇关于为什么我的levenshtein距离计算器无法处理PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆