如何以编程方式基于视觉差异比较两个 PDF? [英] How to compare two PDFs based on visual differences programmatically?

查看:24
本文介绍了如何以编程方式基于视觉差异比较两个 PDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较并获取两个 PDF 文件中的所有视觉差异.我知道在堆栈溢出时有一些与此相关的问题,但它们不能满足我的需要.

I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.

我目前正在使用 PDFBox 为 PDF 中的页面生成图像并比较图像的字节数.

I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.

通过这种方法,我能够知道特定页面是不同的.

By this approach I'm able to know that particular page is differing.

但我需要了解一些更精细的细节,例如某些文本的字体大小,例如 - 文本"的页码不同,例如 PDF 中的 6.

But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.

不仅是文本,我还需要处理所有视觉差异,例如图像、图表中的文本等.

Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.

请建议我以某种方式实现这一目标.

Please suggest me someway to achieve this.

PS:我尝试使用 Apache Tika,但我感觉它可以用于获取 XHTML 和元数据中的结构化文本.但是我看到诸如字体大小、字体 8 之类的细节没有出现在结构化文本中.如果我弄错了,请纠正我.

PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.

推荐答案

PDF to image using Java

在 Java 中将 PDF 转换为缩略图(有一个例子pdf-renderer 在这里使用)

Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)

https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&;rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw

将 PDF 转换为 TIFF 的好库?

将 jpeg/png 转换为像素数组在java中

int 像素数组到 java 中的 bmp

查找像素位置

获取图像周围的像素颜色

使用 PDFBox 提取文本:使用 pdfbox 从 PDF 文件中提取文本

For extraction of text using PDFBox: Extracting text from PDF file using pdfbox

PDFBox 中有用于检测字体位置、类型、大小以及可能(没有更深入地搜索)其他设置的类.(下面的链接)然后,您可以从两个 PDF 中提取文本,比较它们以检查文本是否相等,然后 - 如果它们相等 - 比较它们的格式.如果有什么不同,请标记以显示在另一个文本、图像或 PDF 中.

There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html

http:///pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html

这篇关于如何以编程方式基于视觉差异比较两个 PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆