iTextSharp比较2个PDF的相等性 [英] iTextSharp comparing 2 PDFs for equality

查看:62
本文介绍了iTextSharp比较2个PDF的相等性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在生成PDF并将其存储在数据库中.

I am generating and storing PDFs in a database.

使用Convert.ToBase64String(pdf.ByteArray)将pdf数据存储在文本字段中

The pdf data is stored in a text field using Convert.ToBase64String(pdf.ByteArray)

如果我生成数据库中已经存在的相同的精确PDF,并比较2个base64字符串,则它们是不相同的.很大一部分是相同的,但每次出现的文字中大约有5-10%都是不同的.

If I generate the same exact PDF that already exists in the database, and compare the 2 base64strings, they are not the same. A big portion is the same, but it appears about 5-10% of the text is different each time.

如果两个pdf都使用相同的方法生成,那会导致2个pdf不同吗?

What would make 2 pdfs different if both were generated using the same method?

这是一个问题,因为我无法确定自从上次保存到数据库以来,是否对PDF进行了修改.

This is a problem because I can't tell if the PDF was modified since it was last saved to the db.

查看实际pdf时,两个pdf在视觉上看起来完全一样,但是字节的base64string不同

The 2 pdfs visually appear exactly the same when viewing the actual pdf, but the base64string of the bytes are different

推荐答案

两个 外观 外观相同的PDF在封面下可能完全不同. PDF生成程序可以自由地将单词"hello"作为单个单词或以任意顺序书写的五个单独字母.他们还可以自由地先绘制表格的行,然后再绘制单元格内容,或者首先绘制单元格内容,或者一次绘制这些单元格的任意组合,例如一次绘制一个单元格.

Two PDFs that look 100% the same visually can be completely different under the covers. PDF producing programs are free to write the word "hello" as a single word or as five individual letters written in any order. They are also free to draw the lines of a table first followed by the cell contents, or the cell contents first, or any combination of these such as one cell at a time.

如果您实际上是通过编程方式创建PDF,并且使用完全相同的代码创建了两个PDF,则 静止 不会得到100%相同的文件.造成这种情况的原因有两个,最明显的是PDF支持创建和修改日期.这些显然会根据创建时间而改变.您可以使用类似以下内容来覆盖这些内容(并混淆其他所有人,因此我不推荐这样做):

If you are actually programmatically creating the PDFs and you create two PDFs using completely identical code you still won't get files that are 100% identical. There's a couple of reasons for this, the most obvious is that PDFs support creation and modification dates. These will obviously change depending on when they are created. You can override these (and confuse everyone else so I don't recommend this) using something like this:

var info = writer.Info;
info.Put(PdfName.CREATIONDATE, new PdfDate(new DateTime(2001,01,01)));
info.Put(PdfName.MODDATE, new PdfDate(new DateTime(2001,01,01)));

但是,PDF在预告片的/ID条目中也支持唯一标识符.据我所知,iText不支持覆盖此参数.您可以复制PDF,手动进行更改,然后计算差异,然后您可能会更接近比较结果.

However, PDFs also support a unique identifier in the trailer's /ID entry. To the best of my knowledge iText has no support for overriding this parameter. You could duplicate your PDF, change this manually and then calculate your differences and you might get closer to a comparison.

然后是字体.在对字体进行子集设置时,生产者会基于原始名称和任意选择的六个大写ASCII字母来创建唯一的内部名称.因此,对于Calibri字体,字体名称可以一次为JLXWHD+Calibri,另一次为SDGDJT+Calibri. iText不支持此功能,因为您可能弊大于利.这些内部名称用于避免字体子集冲突.

Then there's fonts. When subsetting fonts, producers create a unique internal name based on the original name and an arbitrary selection of six uppercase ASCII letters. So for the font Calibri the font's name could be JLXWHD+Calibri one time and SDGDJT+Calibri another time. iText doesn't support overriding of this because you'd probably do more harm than good. These internal names are used to avoid font subset collisions.

所以简短的答案是,除非您要比较两个文件,它们是彼此的物理副本,否则无法对其二进制内容进行直接比较.长的答案是,您可以调整一些PDF条目以删除唯一的部分仅供比较,但您所做的工作可能比将文件重新存储在数据库中所需的工作还要多.

So the short answer is that unless you are comparing two files that are physical duplicates of each other you can't perform a direct comparison on their binary contents. The long answer is that you can tweak some of the PDF entries to remove unique parts for comparison only but you'd probably be doing more work than it would take to just re-store the file in the database.

这篇关于iTextSharp比较2个PDF的相等性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆