检测PDF文件是否正确(标题PDF) [英] Detect if PDF file is correct (header PDF)

查看:27
本文介绍了检测PDF文件是否正确(标题PDF)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个管理许多 PDF 文件的 Windows .NET 应用程序.部分文件已损坏.

I have a windows .NET application that manages many PDF Files. Some of the files are corrupt.

2个问题:我会尽量用我不完美的英语解释......对不起

2 issues: I'll try to explain in my imperfect English...sorry

1.)

如何检测 pdf 文件是否正确?

How can I detect if any pdf file is correct ?

我想阅读 PDF 的标题并检测它是否正确.

I want to read header of PDF and detect if it is correct.

var okPDF = PDFCorrect(@"C: emppdfile1.pdf");

var okPDF = PDFCorrect(@"C: emppdfile1.pdf");

2.)

如何知道文件的byte[](bytearray)是否为PDF文件.

How to know if byte[] (bytearray) of file is PDF file or not.

例如,对于 ZIP 文件,您可以检查前四个字节并查看它们是否与本地标头签名(即十六进制)匹配

For example, for ZIP files, you could examine the first four bytes and see if they match the local header signature, i.e. in hex

50 4b 03 04

50 4b 03 04

if (buffer[0] == 0x50 && buffer[1] == 0x4b && buffer[2] == 0x03 &&缓冲区[3] == 0x04)

if (buffer[0] == 0x50 && buffer[1] == 0x4b && buffer[2] == 0x03 && buffer[3] == 0x04)

如果你把它加载到一个 long 中,这是 (0x04034b50).作者:大卫·皮尔森

If you are loading it into a long, this is (0x04034b50). by David Pierson

我想要同样的 PDF 文件.

I want the same for PDF files.

byte[] dataPDF = ...

byte[] dataPDF = ...

var okPDF = PDFCorrect(dataPDF);

var okPDF = PDFCorrect(dataPDF);

.NET 中的任何示例源代码?

Any sample source code in .NET?

推荐答案

a.不幸的是,没有简单的方法可以确定 pdf 文件是否已损坏.通常,问题文件具有正确的标题,因此损坏的真正原因是不同的.PDF 文件实际上是 PDF 对象的转储.该文件包含一个参考表,给出了每个对象从文件开头的确切字节偏移位置.因此,很可能损坏的文件具有损坏的偏移量,或者可能遗漏了某些对象.

a. Unfortunately, there is no easy way to determine is pdf file corrupt. Usually, the problem files have a correct header so the real reasons of corruption are different. PDF file is effectively a dump of PDF objects. The file contains a reference table giving the exact byte offset locations of each object from the start of the file. So, most probably corrupted files have a broken offsets or may be some object is missed.

检测损坏文件的最佳方法是使用专门的 PDF 库..NET 有许多免费和商业 PDF 库.您可以简单地尝试使用此类库之一加载 PDF 文件.iTextSharp 将是一个不错的选择.

The best way to detect the corrupted file is to use specialized PDF libraries. There are lots of both free and commercial PDF libraries for .NET. You may simply try to load PDF file with one of such libraries. iTextSharp will be a good choice.

B.根据 PDF 参考,PDF 文件的标题通常看起来像 %PDF-1.X(其中 X 是一个数字,目前从 0 到 7).并且 99% 的 PDF 文件都有这样的标题.但是,Acrobat Viewer 接受其他类型的标题,对于 PDF 查看器来说,即使没有标题也不是真正的问题.因此,如果文件不包含标题,则不应将文件视为已损坏.例如,标题可能出现在文件的前 1024 个字节内的某处,或者采用 %!PS-Adobe-N.n PDF-M.m

b. According to the PDF reference the header of a PDF file usually looks like %PDF−1.X (where X is a number, for the present from 0 to 7). And 99% of PDF files have such header. However, there are some other kinds of headers which Acrobat Viewer accepts and even absence of a header isn't a real problem for PDF viewers. So, you shouldn't treat file as corrupted if it does not contain a header. E.g., the header may be appeared somewhere within the first 1024 bytes of the file or be in the form %!PS−Adobe−N.n PDF−M.m

仅供参考,我是 Docotic PDF 库的开发人员.

Just for your information I am a developer of the Docotic PDF library.

这篇关于检测PDF文件是否正确(标题PDF)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆