cs50 pset4恢复-为什么将整个文件写入内存失败check50? [英] cs50 pset4 Recover - why does writing entire file to memory fail check50?

查看:38
本文介绍了cs50 pset4恢复-为什么将整个文件写入内存失败check50?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究恢复,并且想知道是否我的方法存在一些根本性的缺陷.演练建议使用fread()以512字节块的形式浏览文件,查找JPEG标头,并在每次找到文件时将其写入新文件-00X.jpg.相反,我尝试使用malloc()创建任意大的临时缓冲区,使用fread()的返回值确定文件的大小,并将整个文件写入具有两种数据类型的结构数组中..B代表BYTE,以存储文件,.header代表bool,以指示每个JPEG标头的开始位置.

I'm working on Recover and am wondering if there's something fundamentally flawed with my approach. The walkthrough suggests using fread() to go through a file in 512 byte chunks, look for JPEG headers, and write to a new file - 00X.jpg - each time one is found. What I tried instead was creating an arbitrarily large temporary buffer with malloc(), using fread()'s return value to determine the size of the file, and writing the entirety of the file to an array of structs with two data types; .B for BYTE, to store the file, and .header for bool, to indicate where each JPEG header begins.

我遇到两个问题.一是恢复的图像没有通过check50,二是试图一次从我的数组中写入一个以上的字节会导致垃圾字节.这是我在做什么:

I'm running into two problems. One is that recovered images don't pass check50, and two is that trying to write more than one byte at a time from my array results in garbage bytes. Here's what I'm doing:

typedef uint8_t BYTE;
typedef struct
{
    BYTE B;
    bool header;
}
images;

这同时使用字节和布尔值定义了数据类型BYTE和我的结构.

This defines the data type BYTE and my struct using both bytes and bools.

BYTE *tmp_buffer = malloc(4000000 * sizeof(BYTE));
int counter = fread(tmp_buffer, sizeof(BYTE), 4000000, file);
images buffer[counter];

这将使用malloc()创建任意大的缓冲区,并使用它和fread的返回值确定文件的字节大小,然后在内存中创建要使用的缓冲区.

This creates the arbitrarily large buffer with malloc(), uses it and the return value of fread to determine the byte size of the file, and then creates a buffer in memory to work with.

for (int copy = 0; copy < counter; copy++)
{
    buffer[copy].header = false;
    buffer[copy].B = tmp_buffer[copy];
}
free(tmp_buffer);
fclose(file);
for (int check = 0; check < counter; check++)
{
    if (buffer[check].B == 0xff && buffer[check + 1].B == 0xd8 && buffer[check + 2].B == 0xff)
    {
        buffer[check].header = true;
    }
}

这会将每个字节从临时"缓冲区复制到永久缓冲区,将所有标头设置为false,然后关闭文件/释放内存.然后,它找到JPEG标头并将其设置为true.从这里开始,我正在尝试查看有效的方法:

This copies every byte from the 'temporary' buffer to the permanent one, sets all of the headers to false, and then closes the file/frees the memory. Afterwards, it finds the JPEG headers and sets them to true. From here is me experimenting to see what works:

int headers_counter = 1;
for (int header_location = 0; header_location < counter; header_location+= 512)
{
    if (buffer[header_location].header == true)
    {
        printf("%i. %i\n", headers_counter, header_location);
        headers_counter++;
    }
}

这将打印原始文件中每个标头的编号和数组(而不是字节)位置,并且看起来可以正常工作.我说出现"是因为以下代码确实可以恢复图像:

This prints the number and array (not byte) position of every header in the original file, and it appears to work. I say 'appears' because the following code does recover an image:

int file_number = 0;
char file_name[8];
sprintf(file_name, "%03i.jpg", file_number);
FILE *img = fopen(file_name, "w");
for (int i = 1024; i < 115200; i++)
{
    fwrite(&buffer[i].B, sizeof(BYTE), 1, img);
}

这并不是要解决整个问题,即恢复所有50张图像.它仅用于恢复000.jpg,方法是从000.jpg标头的第一个字节开始,到001.jpg标头之前的最后一个字节结束(这是一个硬编码的示例,使用打印在上述终端上的标头位置,也是一个例子).看起来是这样做的,但是它失败check50并显示错误恢复的图像不匹配".

This is not intended to solve the entire problem, i.e. recover all 50 images. It's only intended to recover 000.jpg by beginning at the the first byte of 000.jpg's header and ending at the last byte before 001.jpg's header (edit: it's a hard-coded example using the header locations printed to the terminal above, also an example). It appears to do so, but it fails check50 with the error "recovered image does not match."

我的女友也在上课,她按照演练的建议实施了自己的代码.我们以十六进制输出打开了我们的000.jpg文件并进行了比较.我们并没有遍历每个字节,但是前几十行和后几十行似乎是相同的,都是空余的.

My girlfriend is also taking the class, and she implemented her code the way the walkthrough suggests. We opened our 000.jpg files in hex output and compared. We didn't go through every byte, but the first and last few dozen rows appeared to be identical, slack space and all.

我提到的另一件事是一次写入多个字节时的垃圾字符.如果我将最终循环更改为此:

The other thing I mentioned is garbage characters when writing more than one byte at a time. If I change my final loop to this:

for (int i = 1024; i < 115200; i+= 512)
{
    fwrite(&buffer[i].B, sizeof(BYTE), 512, img);
}

然后它的工作效率更低,并且000.jpg说这是无效或不受支持的图像格式.我看了一下十六进制的输出,这是在比较原始循环的第一行和上面的那一行以512递增时看到的:

Then it works even less and 000.jpg says it's an invalid or unsupported image format. I looked at the hex output, and this is what I see when comparing the first row of my original loop and the one above that increments by 512:

ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 01
ff 01 d8 00 ff 00 e0 00 00 00 10 00 4a 00 46 00

其他位置都有一个额外的字节!我在这里不知所措.在这一点上,更多的是了解这些行为.我敢肯定两者都有合理的解释,但这让我发疯了!我尝试做一个字节数组,而不是添加一个bool的结构,它做同样的事情.

There's an extra byte in every other position! I'm at a loss here. At this point, it's more about understanding these behaviors. I'm sure there's a logical explanation for both, but it's driving me crazy! I tried doing an array of bytes instead of the struct with a bool added, and it did the same thing.

推荐答案

如上面注释中所述,通过尝试使用结构并尝试将每个要存储的jpg存储为一次-您正在做很多事情比需要更难.当说明讨论FAT文件系统(该文件是从中获取图像的卡上)时,将每个文件的大块存储在512字节的扇区中.要扫描卡,您只需要一个512字节的缓冲区来处理对其输出文件的读取和立即写入.不需要结构,也不需要动态分配内存.

As written above in the comments, by trying to use a struct and by trying to store each jpg to be written out as once -- you are making things much harder than need be. As the directions discuss the FAT filesystem (which was on the card where the images were taken from), stores chunks of each file in 512 byte sectors. To scan the card, all you need is a 512 byte buffer to handle the read and immediate write to its output file. No structures are needed and there is no need to dynamically allocate memory.

读取的方法是从文件中读取每个512数据块.然后,您需要检查该块的前4个字节是否包含jpg标头.一个简短的测试功能可以写成:

The way to approach the read is to read each 512 block of data from the file. You then need to check if the first 4-bytes of the block hold the jpg header. A short function to test for you could be written as:

#include <stdio.h>
#include <stdlib.h>

#define FNLEN 128       /* if you need a constant, #define one (or more) */
#define BLKSZ 512

/* check if first 4-bytes in buf match jpg header */
int chkjpgheader (const unsigned char *buf)
{
    return  buf[0] == 0xff && 
            buf[1] == 0xd8 && 
            buf[2] == 0xff && 
            buf[3] >> 4 == 0xe;
}

(您只需测试每个条件是否为 true 并返回条件的结果)

(you simply test if each condition is true returning the result of the conditional)

考虑如何处理jpg标头的扫描和读取文件,您可以在一个循环中完成所有操作,该循环从输入中读取512个字节,并保留找到的jpg标头数量的计数器-您也可以将其用作指示标头的标志被找到.您将读取数据块,测试它是否为标题,如果不是,则为第一个标题,关闭最后写入的jpg文件的输出文件,创建新文件名,打开文件(验证每个步骤),然后循环检查每个512字节块的开始以查找标头签名时,将数据写出.重复直到文件用完.

Thinking how to handle scanning for jpg headers and reading the file, you can do it all in a single loop that reads 512 bytes from input, and keeping a counter of the number of jpg headers found -- which you also use as a flag to indicate a header was found. You will read the block of data, test if it is a header, if so, if not the first header, close the output file for the last jpg file written, create a new filename, open the file (validating each step) and then write the data out as you loop checking the start of each 512 byte block for the header signature. Repeat until you run out of file.

您可以实现类似于:

/* find each jpg header and write contents to separate file_000x.jpg files.
 * returns the number of jpg files successfully recovered.
 */
int recoverjpgs (FILE *ifp)
{
    char jpgname[FNLEN] = "";       /* jpg output filename */
    unsigned char buf[BLKSZ];       /* read buffer */
    int jpgcnt = 0;                 /* found jpg header count*/
    size_t nbytes;                  /* no. of bytes read/written */
    FILE *fp = NULL;                /* FILE* pointer for jpg output */
    
    /* read until jpg header found */
    while ((nbytes = fread (buf, 1, BLKSZ, ifp)) > 0) {
        /* check if jpg header found */
        if (nbytes >= 4 && chkjpgheader(buf)) {
            /* if not 1st header, close current file */
            if (jpgcnt) {
                if (fclose (fp) == EOF) {   /* validate every close-after-write */
                    perror ("recoverjpg()-fclose");
                    return jpgcnt - 1;
                }
            }
            /* create output filename (e.g. file_0001.jpg) */
            sprintf (jpgname, "file_%04d.jpg", jpgcnt + 1);
            /* open next file/validate file open for writing */
            if ((fp = fopen (jpgname, "wb")) == NULL) {
                perror ("fopen-outfile");
                return jpgcnt;
            }
            jpgcnt += 1;    /* increment recovered jpg count */
        }
        /* if header found - write block in buf to output file */
        if (jpgcnt && fwrite (buf, 1, nbytes, fp) != nbytes) {
            perror ("recoverjpg()-fwrite");
            return jpgcnt - 1;
        }
    }
    /* if file opened, close final file */
    if (jpgcnt && fclose (fp) == EOF) {     /* validate every close-after-write */
        perror ("recoverjpg()-fclose");
        return jpgcnt - 1;
    }
    
    return jpgcnt;  /* return number of jpg files recovered */
}

(注意: jpgcnt 既用作计数器,又用作用于控制第一个 fclose()的标记 jpg文件上的code>,并控制何时首次写入第一个文件.)

(note: jpgcnt is used both as a counter and a flag to control when the first fclose() on a jpg file occurs and to control when the first write to the first file occurs.)

看看退货.了解为什么在函数的不同位置返回 jpgcnt jpgcnt-1 .也了解为什么在写一次之后总是检查 fclose()的返回值.将最终数据刷新到文件并关闭文件时,可能会发生许多错误-上次检查最后一次写入将不会捕获这些错误.因此,规则-始终验证写后关闭.关闭输入文件时无需检查.

Look at the returns. Understand why jpgcnt or jpgcnt - 1 is being returned at different places in the function. Also understand why you always check the return of fclose() after-a-write has taken place. There a number of errors that can occur when the final data is flushed to the file and the file is closed -- which would not be caught by the last checking the last write. So rule -- always validate close-after-write. There is no need for the check when closing your input file.

这就是您所需要的.在 main()中,您将打开输入文件,只需将打开的文件流传递给 recoverjpgs()函数,保存返回值即可知道成功恢复了多少个jpg文件.它可以很简单:

That's all you need. In main() you will open the input file and simply pass the open filestream to the recoverjpgs() function saving the return to know how many jpg files were successfully recovered. It can be as simple as:

int main (int argc, char **argv) {
    
    FILE *fp = NULL;            /* input file stream pointer */
    int jpgcnt = 0;             /* count of jpg files recovered */
    
    if (argc < 2 ) {    /* validate 1 argument given for filename */
        fprintf (stderr, "error: insufficient input,\n"
                         "usage: %s filename\n", argv[0]);
        return 1;
    }
    
    /* open file/validate file open for reading */
    if ((fp = fopen (argv[1], "rb")) == NULL) {
        perror ("fopen-argv[1]");
        return 1;
    }
    
    if ((jpgcnt = recoverjpgs(fp)))
        printf ("recovered %d .jpg files.\n", jpgcnt);
    else
        puts ("no jpg files recovered.");
        
    fclose (fp);
}

那是完整的程序,只需将3件复制/粘贴在一起,然后尝试一下.

That is the complete program, just copy/paste the 3-pieces together and give it a try.

使用/输出示例

$ ./bin/recover ~/doc/c/cs50/recover/card.raw
recovered 50 .jpg files.

(将在当前目录中创建50个文件,从 file_0001.jpg file_0050.jpg ,您可以欣赏气球,花朵,女孩等...显示在jgp文件中.)

(the 50 files, file_0001.jpg to file_0050.jpg will be created in the current directory -- and you can enjoy the balloons, flowers, girls, etc... shown in the jgp files.)

仔细检查一下,如果还有其他问题,请告诉我.

Look things over and let me know if you have further questions.

关于分配和存储每个文件一次写入的按注释编辑

即使您想在一次写入之前完全缓冲每个文件,也可以使用带有单个 uint8_t (字节)和 bool 的结构来标记是否struct是头字节没有多大意义.为什么?它使写例程变得一团糟.写入时要检查分配的块中的每个结构大到足以容纳整个 card.raw 文件的位置,以捕获每个结构都有其 bool 的4结构序列标志设置为true -本质上是在读取过程中重复进行的所有测试以查找标头字节,并将您的 bool 结构成员 true 开始.

Even if you want to buffer each file fully before writing once, the idea of using a struct with a single uint8_t (byte) and a bool to flag whether that struct is a header byte doesn't make much sense. Why? It makes a mess out of the write routine. Which would have to check every struct in an allocated block large enough to hold the entire card.raw file when writing to catch the 4-struct sequence where each struct has its bool flag set true -- essentially duplicating all testing that was done during the read to find the header bytes and set your bool struct member true to begin with.

如前所述,如果有成千上万的文件,则需要扫描 card.raw 中的输入流,并将每个jpg的字节保存在缓冲区中,以便可以将它们写入一次在过程继续进行的同时将文件保存到文件中(您甚至可以将 fork 分叉到一个单独的进程中,这样,如果您确实想进行调整,则读可以继续进行而不必等待写操作.

As mentioned, if there were zillions of files, you would want to scan through the input stream from card.raw and save the bytes for each jpg in your buffer so that they could be written once to the file while the process continues (you could even fork the write to a separate process so the read could continue without waiting for the write if you really wanted to tweak things.

无论如何,方法都是相同的.如果您动态分配 buf ,则可以用每个jpg文件填充该文件,并在找到下一个标头时-将 buf 的当前内容写到文件的下一个标头(将下一个标头移到 buf 的开头),然后重复进行直到您用完输入进行检查为止.

Regardless, the approach will be the same. If you dynamically allocate for buf, you can fill it with each jpg file and when the next header is found -- write the current contents of buf up to the beginning of the next header to your file, (the move the next header read to the start of buf) and repeat until you run out of input to check.

您将在整个过程中为 buf 重用已分配的存储空间,并且仅当当前文件需要的存储空间大于当前分配的存储空间时才进行扩展.(因此 buf 的大小最终可以容纳一天结束时找到的最大jpg).这样可以最大程度地减少分配,这意味着所有50个文件中唯一需要的 realloc 是遇到较大文件时需要的 realloc .如果接下来的20个文件都适合当前分配的缓冲区,则无需进行调整,并且您会不断地用 buf 填充不同的jpg文件内容,因为它们是从法证图像"中恢复的.;(声音很重要)

You will reuse the allocated storage for buf throughout the process and only expanding if the current file requires more storage than currently allocated. (so buf ends up sized to hold the largest jpg found at the end of the day). This minimizes allocations and means the only reallocs required over all 50 files are the reallocs needed when a larger file is encountered. If the next 20 files all fit within the currently allocated buffer -- no adjustment is needed and you keep filling buf over and over again with the different jpg file contents as they are recovered from the "forensic image" (sounds important)

仅添加一个 bufsz 变量以跟踪 buf 的当前分配大小,并添加一个 total 变量以跟踪总字节读取每个jpg文件.除此之外,您只是重新排列文件的写入位置,以便等到将一个完整的jpg读入 buf 之前,再打开并将这些字节写入文件,然后在文件之后立即关闭文件是这样写的(写了一个简短的函数来处理这个问题-因为写一个通用可重用函数是很有意义的,它可以将一定数量的字节从缓冲区写到给定名称的文件中.

There are only the addition of a bufsz variable to track the current allocation size of buf and a total variable to track the total bytes read in each jpg file. Other than that you are just rearranging where the files are written so that you wait until one complete jpg has been read into buf before opening and writing those bytes to the file and then closing the file immediately after the file is written (a short function was written to handle that -- since it made sense to write a generic-reusable function to write a given number of bytes from a buffer to a file of a given name.

完整的文件可以编写如下.

The complete file could be written as follows.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>

#define FNLEN 128       /* if you need a constant, #define one (or more) */
#define BLKSZ 512
#define JPGSZ 1<<15     /* 32K initial allocation size */

/* write 'nbytes' from 'buf' to 'fname'. returns number of bytes
 * written on success, zero otherwise.
 */ 
size_t writebuf2file (const char *fname, void *buf, size_t nbytes)
{
    FILE *fp = NULL;    /* FILE* pointer for jpg output */
    
    /* open file/validate file open for writing */
    if ((fp = fopen (fname, "wb")) == NULL) {
        perror ("writebuf2file-fopen");
        return 0;
    }
    /* write buffer to file/validate bytes written */
    if (fwrite (buf, 1, nbytes, fp) != nbytes) {
        perror ("writebuf2file()-fwrite");
        return 0;
    }
    /* close file/validate every close-after-write */
    if (fclose (fp) == EOF) {
        perror ("writebuf2file-fclose");
        return 0;
    }
    
    return nbytes;
}

/* check if first 4-bytes in buf match jpg header */
int chkjpgheader (const unsigned char *buf)
{
    return  buf[0] == 0xff && 
            buf[1] == 0xd8 && 
            buf[2] == 0xff && 
            buf[3] >> 4 == 0xe;
}

/* find each jpg header and write contents to separate file_000x.jpg files.
 * returns the number of jpg files successfully recovered.
 */
int recoverjpgs (FILE *ifp)
{
    char jpgname[FNLEN] = "";                   /* jpg output filename */
    int jpgcnt = 0;                             /* found jpg header count*/
    size_t  nbytes,                             /* no. of bytes read/written */
            bufsz = JPGSZ,                      /* tracks current allocation of buf */
            total = 0;                          /* tracks total bytes in jpg file */
    uint8_t *buf = malloc (JPGSZ);              /* read buffer */
    
    if (!buf) { /* validate every allocation/reallocation */
        perror ("malloc-buf");
        return 0;
    }
    
    /* read until jpg header found */
    while ((nbytes = fread (buf + total, 1, BLKSZ, ifp)) > 0) {
        /* check if jpg header found */
        if (nbytes >= 4 && chkjpgheader(buf + total)) {
            /* if not 1st header, write buffer to file, reset for next file */
            if (jpgcnt) {
                /* create output filename (e.g. file_0001.jpg) */
                sprintf (jpgname, "file_%04d.jpg", jpgcnt);
                /* write current buf to file */
                if (!writebuf2file (jpgname, buf, total))
                    return jpgcnt - 1;
                /* move header block to start of buf */
                memmove (buf, buf + total, BLKSZ);
                total = 0;                  /* reset total for next file */
            }
            jpgcnt += 1;    /* increment recovered jpg count */
        }
        /* if header found - began accumulating blocks in buf */
        if (jpgcnt)
            total += nbytes;
        /* check if reallocation required before next read */
        if (total + BLKSZ > bufsz) {
            /* add a fixed 32K each time reallocaiton required
             * always realloc to a temporary pointer to prevent memory leak
             * on realloc failure.
             */
            void *tmp = realloc (buf, bufsz + (1 << 15));
            if (!tmp) {                     /* validate every reallocations */
                perror ("realloc-buf");
                return jpgcnt - 1;
            }
            buf = tmp;              /* assign reallocated block to buf */
            bufsz += 1 << 15;       /* update bufsz with new allocation size */
        }
    }
    /* write final buffer to file */
    if (jpgcnt) {
        /* create output filename (e.g. file_0001.jpg) */
        sprintf (jpgname, "file_%04d.jpg", jpgcnt);
        /* write current buf to file */
        if (!writebuf2file (jpgname, buf, total))
            return jpgcnt - 1;
    }
    
    free (buf);     /* free allocated memory */
    
    return jpgcnt;  /* return number of jpg files recovered */
}

int main (int argc, char **argv) {
    
    FILE *fp = NULL;            /* input file stream pointer */
    int jpgcnt = 0;             /* count of jpg files recovered */
    
    if (argc < 2 ) {    /* validate 1 argument given for filename */
        fprintf (stderr, "error: insufficient input,\n"
                         "usage: %s filename\n", argv[0]);
        return 1;
    }
    
    /* open file/validate file open for reading */
    if ((fp = fopen (argv[1], "rb")) == NULL) {
        perror ("fopen-argv[1]");
        return 1;
    }
    
    if ((jpgcnt = recoverjpgs(fp)))
        printf ("recovered %d .jpg files.\n", jpgcnt);
    else
        puts ("no jpg files recovered.");
        
    fclose (fp);
}

在您编写的任何动态分配内存的代码中,对于任何分配的内存块,您都有2个职责:(1)始终保留指向起始地址的指针因此,(2)当不再需要它时,可以释放.

In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed.

当务之急是使用一个内存错误检查程序来确保您不尝试访问内存或不在分配的块的边界之外/之外写,尝试读取或基于未初始化的值进行条件跳转,最后,以确认您释放了已分配的所有内存.

It is imperative that you use a memory error checking program to ensure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated.

对于Linux, valgrind 是通常的选择.每个平台都有类似的内存检查器.它们都很容易使用,只需通过它运行程序即可.

For Linux valgrind is the normal choice. There are similar memory checkers for every platform. They are all simple to use, just run your program through it.

始终确认已释放已分配的所有内存,并且没有内存错误.

Always confirm that you have freed all memory you have allocated and that there are no memory errors.

花点时间浏览代码.如果您还有其他问题,请告诉我.

Take your time and go though the code. Let me know if you have further questions.

这篇关于cs50 pset4恢复-为什么将整个文件写入内存失败check50?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆