将 PDF 复制到新的 PDF,但没有文档的某些部分 [英] Copy PDF to a new PDF, but without certain bits of the document

查看:61
本文介绍了将 PDF 复制到新的 PDF,但没有文档的某些部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试做一些我知道不是 100% 可靠的事情,但我已经阅读了相关内容,据我所知,我在尝试从文本中删除某些文本时面临的唯一问题是PDF文件是我无法替换的.

我想要做的是获取 PDF 文件的内容,然后将该内容复制到另一个 PDF 文件,但没有找到正则表达式.我在我的 PDF 文件中找到了这些表达式,并且它有效.

但是,我想不出删除它们的方法.有没有办法说像

//删除此列表中的所有 TextPosition 对象

因为我已经收集了它们,我不明白为什么这不起作用.

或者有没有办法覆盖写入新文件的内容,然后让覆盖的方法跳过我告诉它跳过的所有文本位置?我见过这样的例子,但是当我尝试它们时似乎没有任何效果.(其实很多被覆盖的方法似乎根本就没有被调用过)

解决方案

我不明白为什么这不起作用

这至少很难的一个原因是,在 PDF 中没有 TextPosition 对象.

在 PDF 中,您可以找到以某种任意编码绘制字符串的说明.PDFBox 解析机制将这些字符串拆分为单个字符,确定它们的位置等,并从中构建TextPosition.不幸的是,它没有添加对原始字符串和其中字符位置的引用.

因此,为了使代码能够识别 PDF 中匹配的字符串部分,它必须在复制之前再次进行所有解析和比较.

因此,要实现您的目标,您最好不仅要使用 TextPosition 对象,还要以某种方式将它们链接回它们的起始字符串.

这在某种程度上超出了堆栈溢出答案的范围,但由于这是您的 BA 工作的(或至少一个)重点,因此适当的尝试可能适合该范围.

因此,我将在这里提供一些指导,让您了解如何开始.

为什么在 PDFBox 中没有这样的机制开始?

其实在PDFBox发行版(版本2之前)曾经有过一个编辑PDF文档文本内容的例子.然而,越来越明显的是,这个例子依赖于一些先决条件,因为不满足这些先决条件的文档变得越来越普遍,所以这个例子被删除了,参见.PDFBox 2.0.0 迁移指南.

您可以在这个答案中找到对简单文本替换障碍的更详细描述,其精髓在于通用文本替换介于复杂和不可能之间;但是,如果您可以在原始 PDF 中要求某些先决条件,那么您可以要求的越多就越容易.

然而,在现实生活中,如果您对输入有一定程度的控制,则只能要求这样的先决条件,例如如果您只处理某些其他程序的输出并且知道那些其他程序满足这些要求.

因此,作为通用库的 PDFBox 删除了简单示例.

一种方法

对于更通用的文本编辑方法,您确实应该尝试将文本删除和文本添加相结合.

对于文本删除,您应该考虑使用类似于这个答案中讨论的通用内容流编辑器类PdfContentStreamEditor.但是,当您想使用表示文本的高级 PDFBox 类(如 TextPosition)时,您可能希望将其基于 PdfTextStripper(使用这些文本位置对象)代替PDFGraphicsStreamEngine.

在那个专门的文本剥离器/内容编辑器中,您可以收集所有正在解析的指令,而不是立即在 write 中再次将它们写出来.此外,您可以将 processTextPosition 检索到的 TextPosition 对象与 write 检索到的当前文本绘制指令相关联,以便以后知道哪个 TextPosition属于哪个文字绘制指令的哪个位置.

解析整个页面后,您可以确定要删除的 TextPosition 对象.

一旦知道它们,找到相关的文本绘制指令和位置.现在您可以拆分每个绘图指令的文本以进行更改,删除要移除的部分,并通过一些位置提升来替换它们(例如,使用 TJ 指令的数组参数中的数字条目).>

一旦所有与要删除的文本位置相关的文本绘制指令都被如此操作,您最终可以将所有指令写入编辑器输出.

此后,您可以像往常一样在相关位置添加新文本.

至少这是我处理更通用文本编辑器任务的方式.仍然存在一些挑战;例如内容流编辑器只编辑单个内容流,而页面的文本可能分布在页面内容流和引用的 XObject 内容流(实际上也是模式内容流)上.

根据您预计在 PDF 编辑任务中投入的工作量,您可能需要也可能不需要研究这些挑战.

文档

在评论中,您表示在任何地方都找不到很多文档.要使用的明显文档是 PDF 规范、ISO 32000-1 和 ISO 32000-2.如果您的部门经常执行深入的 PDF 任务,他们应该为您提供这些任务.如果没有,您可以找到 Adob​​e 在其网站上发布的已删除 ISO 标头的 ISO 32000-1 副本,只需在谷歌上搜索PDF32000"即可.

该规范显然没有记录如何替换文本,但它记录了内容流的外观以及其中可能包含的指令.

I'm trying to do something that I know isn't 100% reliable, but I've read about it and it is my understanding that the only problem I'm facing with trying to remove certain bits of text from a PDF file is that I can't replace them.

What I'm trying to do is take the contents of a PDF file, then copy that content over to another PDF file, but without a regular expression found. I have found the expressions in my PDF file, and it works.

However, I can't figure out a way to remove them. Is there a way to say something like

// Remove all TextPosition objects that are within this list

Because I have gathered them, and I can't see why this shouldn't work.

Or is there a way to override what gets written to the new file, and then have that overridden method skip all textpositions that I tell it to skip? I've seen examples of this, but none seem to work when I try them out. (In fact, a lot of the methods that are overriden doesn't even seem to be called at all)

解决方案

I can't see why this shouldn't work

One reason why that is at least hard, is that in the PDF there are no TextPosition objects.

In the PDF you find instructions drawing strings in some arbitrary encoding. The PDFBox parsing mechanism splits these strings into individual characters, determines their positions etc, and builds a TextPosition from it. Unfortunately it does not add a reference back to the original string and character position therein.

Thus, for code to be able to recognize the matching string parts in the PDF, it has to do all the parsing again and compare before copying.

Thus, to implement your objective you had better not only work with the TextPosition objects but also somehow link them back to the string they come from to start with.

This is somewhat beyond the scope of a stack overflow answer but as this is the (or at least one) focus of your BA work, a decent attempt may fit that scope.

Thus, I'll give some pointers here to give you an idea how to get started.

Why is there no such mechanism in PDFBox to start with?

Actually there once was an example for editing text content of PDF documents in the PDFBox distribution (before version 2). It became more and more obvious, though, that this example relied on a number of preconditions, because documents not fulfilling those preconditions became more and more common, so this example was removed, cf. the PDFBox 2.0.0 migration guide.

You can find a more detailed description of the hindrances to easy text replacement in this answer the quintessence of which is that generic text replacement is somewhere between complicated and impossible; if you can require certain preconditions in the original PDF, though, it becomes the easier the more you can require.

In real life, though, you can only require such preconditions if you have a certain level of control over the input, e.g. if you only process outputs of certain other programs and know that those other programs to fulfill those requirements.

Consequentially PDFBox, being a general purpose library, removed the simple example.

An approach

For a more generic approach to text editing, you should indeed try a combination of text removal and text addition.

For text removal you should consider using something like the generic content stream editor class PdfContentStreamEditor discussed in this answer. As you want to use highlevel PDFBox classes representing the text (like TextPosition), though, you probably want to base it on the PdfTextStripper (which uses these text position objects) instead of PDFGraphicsStreamEngine.

In that specialized text stripper / content editor, you'd collect all instructions being parsed instead of immediately writing them out again in write. Additionally you'd associate TextPosition objects retrieved by processTextPosition to the current text drawing instruction retrieved by write to later know which TextPosition belongs to which position of which text drawing instruction.

When the whole page is parsed, you then can determine the TextPosition objects you want removed.

Once they are known, find the associated text drawing instruction and position. Now you can split the text of each drawing instruction to change, drop the parts to remove, and replace them by some position advancement (e.g. using numerical entries in the array argument of a TJ instruction).

Once all text drawing instructions related to text positions to delete are so manipulated, you can finally write all the instructions to the editor output.

Thereafter you can add new text as usual at the positions in question.

At least this is how I would approach the task of a more generic text editor. There still are some challenges; e.g. the content stream editor just edits a single content stream while text of a page may be spread over the page content streams and referenced XObject content streams (and actually also pattern content streams).

Depending on the amount of work you are expected to invest in the PDF editing task you may or may not have to look into these challenges.

Documentation

In a comment you remark that you can't find a lot of documentation anywhere. The obvious documentation to use is the PDF specification, ISO 32000-1 and ISO 32000-2. If your department does in-depth PDF tasks a lot, they should have them available for you. If they don't, you can find a copy of ISO 32000-1 with the ISO headers removed published by Adobe on their web site, simply google for 'PDF32000'.

The specification obviously does not document how to replace text but it documents how the content streams look like and which instructions there may be in them.

这篇关于将 PDF 复制到新的 PDF,但没有文档的某些部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆