使用zlib从pdf提取文本 [英] Extract text from pdf using zlib

查看:145
本文介绍了使用zlib从pdf提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用该功能在pdf文件中查找文本并将该文本替换为另一文本.问题是当我进行充气然后更改文本和放气时,最终的pdf中有时会丢失一些文本或图形.这是我的代码中的错误,还是zlib库不支持这种压缩或其他功能?

I am using that function to find a text in the pdf file and replace that text with another text. The problem is when I make inflate and then change the text and deflate, in the final pdf some texts or graphics sometimes are missed. This is an error in my code or zlib library does not support this compression or something?

// Open the PDF source file:
FILE *pdfFile = fopen([sourceFile cStringUsingEncoding:NSUTF8StringEncoding], "rb");

if (pdfFile) {
    // Get the file length:
    int fseekres = fseek(pdfFile, 0, SEEK_END);

    if (fseekres != 0) {
        fclose(pdfFile);
        return nil;
    }

    long filelen = ftell(pdfFile);
    fseekres = fseek(pdfFile, 0, SEEK_SET);

    if (fseekres != 0) {
        fclose(pdfFile);
        return nil;
    }

    char *buffer = new char[filelen];
    size_t actualread = fread(buffer, filelen, 1, pdfFile);

    if (actualread != 1) {
        fclose(pdfFile);
        return nil;
    }

    bool morestreams = true;

    while (morestreams) {
        size_t streamstart = [self findStringInBuffer:buffer search:(char *)"stream" buffersize:filelen];
        size_t streamend = [self findStringInBuffer:buffer search:(char *)"endstream" buffersize:filelen];

        [self saveFile:buffer len:streamstart + 7 fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];

        if (streamstart > 0 && streamend > streamstart) {
            streamstart += 6;

            if (buffer[streamstart] == 0x0d && buffer[streamstart + 1] == 0x0a) {
                streamstart += 2;
            } else if (buffer[streamstart] == 0x0a) {
                streamstart++;
            }

            if (buffer[streamend - 2] == 0x0d && buffer[streamend - 1] == 0x0a) {
                streamend -= 2;
            } else if (buffer[streamend - 1] == 0x0a) {
                streamend--;
            }

            size_t outsize = (streamend - streamstart) * 10;
            char *output = new char[outsize];

            z_stream zstrm;
            zstrm.zalloc = Z_NULL;
            zstrm.zfree = Z_NULL;
            zstrm.opaque = Z_NULL;
            zstrm.avail_in = (uint)(streamend - streamstart + 1);
            zstrm.avail_out = (uint)outsize;
            zstrm.next_in = (Bytef *)(buffer + streamstart);
            zstrm.next_out = (Bytef *)output;

            int rsti = inflateInit(&zstrm);

            if (rsti == Z_OK) {
                int rst2 = inflate(&zstrm, Z_FINISH);
                inflateEnd(&zstrm);

                if (rst2 >= 0) {
                    size_t totout = zstrm.total_out;

                    //search and replace text code here

                    size_t coutsize = (streamend - streamstart + 1) * 10;
                    char *coutput = new char[coutsize];

                    z_stream c_stream;
                    c_stream.zalloc = Z_NULL;
                    c_stream.zfree = Z_NULL;
                    c_stream.opaque = Z_NULL;
                    c_stream.total_out = 0;
                    c_stream.avail_in = (uint)totout;
                    c_stream.avail_out = (uint)coutsize;
                    c_stream.next_in = (Bytef *)output;
                    c_stream.next_out = (Bytef *)coutput;

                    rsti = deflateInit(&c_stream, Z_DEFAULT_COMPRESSION);

                    if (rsti == Z_OK) {
                        rsti = deflate(&c_stream, Z_FINISH);
                        deflateEnd(&c_stream);

                        if (rsti >= 0) {
                            [self saveFile:coutput len:c_stream.total_out fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
                        }
                    }

                    delete [] coutput; coutput = 0;
                    [self saveFile:(char *)"\nendstr" len:7 fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
                }
            }

            delete[] output; output = 0;
            buffer += streamend + 7;
            filelen = filelen - (streamend + 7);
        } else {
            morestreams = false;
        }
    }

    [self saveFile:buffer len:filelen fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
}

fclose(pdfFile);

推荐答案

您的代码中存在多个问题,其影响在示例

There are multiple issues in your code the effects of which are visible in the sample newpdf.pdf you provided in a comment to Bruno's answer:

  1. 将重新压缩的流写入输出文件后,添加"\ nendstr"并继续执行此字符串的大小,即输入缓冲区中源流末尾超过7个字符,这很可能是以防止在"endstream"中看到"stream"作为下一个流的开始:

  1. After you write your re-compressed stream to the output file, you add "\nendstr" and proceed the size of this string, 7, characters beyond the end of the source stream in the input buffer, most likely to prevent seeing the "stream" in "endstream" as the start of the next stream:

[self saveFile:(char *)"\nendstr" len:7 fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
[...]
buffer += streamend + 7;

添加该字符串的问题是,您假定输入缓冲区中的"endstream"之前仅是一个NEWLINE(0x0A)字节.这个假设是错误的,因为

The issue in adding that string is that you assume that the "endstream" in the input buffer is preceded by exactly one NEWLINE (0x0A) byte. This assumption is wrong because

a.在PDF中,有三种有效的行尾标记,一个是单行进给(0x0A),一个是行进回车(0x0D),或者是一个行进回车和行进给对(0x0D 0x0A),并且其中任何一个结束行标记可以在输入缓冲区的"endstream"之前;在以上代码中计算压缩流的末尾时,您忽略了单个CARRIAGE RETURN变量,而在这里忽略了2字节变量;还有:

a. in PDF there are three types of valid end-of-line markers, a single LINE FEED (0x0A), a single CARRIAGE RETURN (0x0D), or a CARRIAGE RETURN and LINE FEED pair (0x0D 0x0A), and any one of these end-of-line markers may precede the "endstream" in the input buffer; in the code further above where you calculate the end of the compressed stream, you ignore the single CARRIAGE RETURN variety, and here you ignore the 2 byte variety; and furthermore:

b. PDF规范甚至不需要但仅建议在流的末尾与"endstream"关键字之间添加行尾,请参见.第7.3.8.1节:

b. the PDF specification does not even require but merely recommends to add an end-of-line between the end of the stream and the "endstream" keyword, cf. section 7.3.8.1:

在数据之后和结束流

这已经中断了示例文件中的第一个流,在该流中源文件中没有行尾标记,因此您的结果将原来的"endstream"替换为"\ nendstram".实际上,这在您的样本中经常发生.

This already breaks the first stream in your sample file in which the source file does not have an end-of-line marker there and your result, therefore, replaces the original "endstream" with a "\nendstram". This actually happens fairly often in your sample.

您完全忽略了字典中的PDF流包含一个条目,该条目包含流的长度,请参见. PDF规范:

You completely ignore that a PDF stream in its dictionary contains an entry containing the length of the stream, cf. section 7.3.8.2 in the PDF specification:

每个流字典都应有一个 Length 条目,该条目指示将PDF文件的多少字节用于流的数据.

Every stream dictionary shall have a Length entry that indicates how many bytes of the PDF file are used for the stream’s data.

即使仅解压缩和重新压缩,您的操作也可能会更改压缩流的长度.因此,您必须更新该 Length 条目.诚然,由于字典在流之前 ,因此这无疑会使您的任务更加困难.此外,在源文件这样的情况下,该条目甚至可能不直接包含该值,而是在文件中的其他地方处引用一个间接对象.

Your manipulation, even if you only decompress and recompress, is likely to change the length of the compressed stream. Thus, you have to update that Length entry. This admittedly makes your task somewhat more difficult as that dictionary is before the stream. Furthermore, in cases like your source file, that entry might even not directly contain the value but instead reference an indirect object somewhere else in the file.

这会破坏文件中的第二个流,该流声称它的长度为8150字节,但是长了200字节.任何PDF查看器都可能假定文件中该流的内容只有8150字节长,因此忽略了后200个字节的内容.这很可能就是您观察到这一点的原因

This breaks the second stream in your file which claims it is 8150 bytes long but instead is some 200 bytes longer. Any PDF viewer may assume the content of that stream in your file is only 8150 bytes long and, thus, ignore the contents of those trailing 200 bytes. This may very well be the reason why you observed that

缺少某些文本或图形.

  • 您完全忽略了PDF具有交叉引用表或流(甚至可能是它们的链),请参见. PDF规范:

    交叉引用表包含允许随机访问文件内间接对象的信息,因此无需读取整个文件即可找到任何特定对象.该表应为每个间接对象包含一个单行条目,并指定该对象在文件正文中的字节偏移量. (从PDF 1.5开始,部分或全部交叉引用信息可能会包含在交叉引用流中;请参见7.5.8,交叉引用流".)

    The cross-reference table contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object. The table shall contain a one-line entry for each indirect object, specifying the byte offset of that object within the body of the file. (Beginning with PDF 1.5, some or all of the cross-reference information may alternatively be contained in cross-reference streams; see 7.5.8, "Cross-Reference Streams.")

    即使仅解压缩和重新压缩,您的操作也可能会更改压缩流的长度.因此,您必须更新交叉引用表中所有后续对象的偏移量.

    Your manipulation, even if you only decompress and recompress, is likely to change the length of the compressed stream. Thus, you have to update the offsets of all following objects in the cross reference table.

    由于结果文件中第二个流的大小已经不同,因此该文件中只有很少的交叉引用条目是正确的.

    As already the size of the second stream in your result file differs, only a very few cross reference entries in that file are correct.

    您假定每个PDF流都是放气的.这个假设是错误的,参见. PDF规范中的表5.

    You assume that every PDF stream is deflated. This assumption is wrong, cf. table 5 in the PDF specification.

    您的代码实际上删除了所有它无法膨胀的流.这也可能是您观察到这一点的原因

    Your code essentially drops all streams it cannot inflate. This may also be a reason why you observed that

    缺少某些文本或图形.

  • 您假定PDF中的流"序列明确指示了流的开始.这是错误的,该序列也可以很容易地在其他上下文中使用.

  • You assume that the sequence "stream" in a PDF unambiguously indicates the start of a stream. This is wrong, that sequence may easily be used in other contexts, too.

    您假定流开始后的PDF中的第一个序列"endstream"明确指示该流的结束.这是错误的,该序列也可能是流内容的一部分.您必须使用流字典中 Length 条目的值.

    You assume that the first sequence "endstream" in a PDF after the start of a stream unambiguously indicates the end of that stream. This is wrong, that sequence may also be part of the stream content. You have to use the value of the Length entry in the stream dictionary.

    此外,您似乎还假设生成的PDF中仍然使用了您所遇到的每个流.不必是这种情况.尤其是在进行增量更新的情况下(请参见 PDF规范),文件中可能有许多不再使用的对象.虽然这不一定会破坏结果文件的语法,但是您所做的更改(如果它们相互依赖)在语义上是不正确的.

    Furthermore you seem to assume that every stream you come along still is used in the resulting PDF. This does not need to be the case. Especially in case of incremental updates (cf. section 7.5.6 in the PDF specification) there may be many objects in the file not in use anymore. While this does not necessarily break the syntax of the result file, your changes (if they depend on each other) are semantically incorrect.

    这篇关于使用zlib从pdf提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆