如何使用Python获取两个PDF文件的差异? [英] How to get the diff of two PDF files using Python?

查看:999
本文介绍了如何使用Python获取两个PDF文件的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要找到两个PDF文件之间的区别.有人知道有任何与Python相关的工具具有直接提供两个PDF文件差异的功能吗?

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?

推荐答案

差异"是什么意思? PDF文本中的差异或某些布局更改(例如,调整了嵌入式图形的大小).第一个很容易检测,第二个几乎无法获取(PDF是一种非常复杂的文件格式,提供了无穷无尽的文件格式化功能).

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).

如果要获取文本差异,只需在两个PDF上运行pdf to text实用程序,然后使用Python的内置差异库获取转换后的文本的差异.

If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.

此问题涉及python中pdf到文本的转换:用于转换的Python模块PDF转换为文本.

This question deals with pdf to text conversion in python: Python module for converting PDF to text.

此方法的可靠性取决于您使用的PDF生成器.如果您使用例如Adobe Acrobat和一些基于Ghostscript的PDF-Creator从SAME单词文档中生成两个PDF,尽管源文档是相同的,但您仍然可能会发现差异.

The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.

这是因为有很多方法可以将源文档的信息编码为PDF,并且每个转换器都使用不同的方法. pdf到文本转换器通常无法找出正确的文本流,尤其是在复杂的布局或表格中.

This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

这篇关于如何使用Python获取两个PDF文件的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆