如何以编程方式基于视觉差异比较两个PDF? [英] How to compare two PDFs based on visual differences programmatically?

查看:84
本文介绍了如何以编程方式基于视觉差异比较两个PDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较并获得两个PDF文件中的所有视觉差异.我知道在堆栈溢出时有一些与此相关的问题,但它们并不能满足我的需求.

I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.

我目前正在使用PDFBox为PDF页面生成图像并比较图像的字节.

I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.

通过这种方法,我可以知道特定页面有所不同.

By this approach I'm able to know that particular page is differing.

但是我需要了解一些更详细的细节,例如某些文本的字体大小,例如-文本"的页码有所不同,例如PDF中的6.

But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.

不仅要处理文本,而且还要注意所有视觉差异,例如图像,图表中的文本等.

Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.

请以某种方式建议我实现这一目标.

Please suggest me someway to achieve this.

PS:我尝试使用Apache Tika,但是我感觉它可以用来获取XHTML和元数据中的结构化文本.但是我看到精细的细节,例如字体大小,字体八没有出现在结构化文本中.如果我弄错了,请纠正我.

PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.

推荐答案

使用Java的PDF到图像

在Java中将PDF转换为缩略图(有一个示例pdf渲染器的使用在这里)

Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)

将PDF转换为TIFF的良好库?

将jpeg/png转换为像素数组在Java中

将像素数组转换为Java中的bmp

查找像素位置

获取图像周围的像素颜色

要使用PDFBox提取文本,请执行以下操作:使用pdfbox从PDF文件提取文本

For extraction of text using PDFBox: Extracting text from PDF file using pdfbox

PDFBox中有一些类可用于检测字体位置,类型,大小以及(可能不是更深入地)其他设置. (下面的链接)然后,您可以从两个PDF中提取文本,比较它们以检查文本是否相等,然后-如果它们相等-比较它们的格式.如果有不同之处,请标记为显示为其他文本,图像或PDF.

There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.

http://pdfbox .apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html

http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html

这篇关于如何以编程方式基于视觉差异比较两个PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆