在Ruby中检查PDF是否已损坏(或只是缺少EOF)的最快方法? [英] Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

查看:241
本文介绍了在Ruby中检查PDF是否已损坏(或只是缺少EOF)的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种检查PDF是否缺少文件字符结尾的方法.到目前为止,我发现我可以使用pdf阅读器gem并捕获MalformedPDFError异常,或者当然我可以只打开整个文件并检查最后一个字符是否为EOF.我需要处理大量可能很大的PDF,并且我想加载尽可能少的内存.

注意:我要检测的所有文件都缺少EOF标记,因此我觉得这是一种比检测一般PDF损坏"更具体的情况.最好的,最快的方法是什么?

解决方案

TL; DR

即使扫描整个大小合理的PDF文件,查找具有或没有相关结构的%%EOF都相对较快.但是,如果将搜索限制为最后一个千字节,则可以提高速度;如果仅想确认%%EOF\n是PDF文件最后一行中的唯一内容,则可以将搜索范围限制为最后6或7个字节. >

请注意,只有完整解析PDF文件才能告诉您该文件是否已损坏,并且只有完整解析File Trailer才能充分验证预告片是否符合标准.但是,我提供了以下两个近似值,它们在一般情况下是相当准确且相对较快的.

检查最后千字节的文件预告片

此选项相当快,因为​​它仅查看文件的末尾,并使用字符串比较而不是正则表达式匹配. 根据Adobe :

Acrobat查看器仅要求%% EOF标记出现在文件的最后1024个字节内.

因此,通过在该范围内查找文件预告片指令,可以进行以下操作:

def valid_file_trailer? filename
  File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end

通过正则表达式对文件预告片进行更严格的检查

但是, ISO标准既复杂又严格得多.它部分说:

文件的最后一行应仅包含文件结尾标记%% EOF.前两行应每行依次包含关键字startxref和解码后的流中从文件开头到最后一个交叉引用节中xref关键字开头的字节偏移量. startxref行之前必须是Trailer字典,该字典由关键字Trailer组成,后跟一系列用双尖括号(<<…>>)括起来的键/值对(使用小于号(3Ch)和GREATER) -THAN SIGNS(3Eh)).

如果没有实际解析PDF,您将无法使用正则表达式来以完美的准确性进行验证,但是您会接近的.例如:

def valid_file_trailer? filename
  pattern = /^startxref\n\d+\n%%EOF\n\z/m
  File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end

I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.

Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?

解决方案

TL;DR

Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.

Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.

Check Last Kilobyte for File Trailer

This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:

Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.

Therefore, the following will work by looking for the file trailer instruction within that range:

def valid_file_trailer? filename
  File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end

A Stricter Check of the File Trailer via Regex

However, the ISO standard is both more complex and a lot more strict. It says, in part:

The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).

Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:

def valid_file_trailer? filename
  pattern = /^startxref\n\d+\n%%EOF\n\z/m
  File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end

这篇关于在Ruby中检查PDF是否已损坏(或只是缺少EOF)的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆