iText 或 iTextSharp 基本文本编辑 [英] iText or iTextSharp rudimentary text edit

查看:30
本文介绍了iText 或 iTextSharp 基本文本编辑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以通过多种方式从 PDF 页面中提取文本:

I can extract text from pages in a PDF in many ways:

String pageText = PdfTextExtractor.GetTextFromPage(reader, i);

这可用于获取页面上的任何文本.

This can be used to get any text on a page.

或者:

byte[] contentBytes = iTextSharp.text.pdf.parser.ContentByteUtils.GetContentBytesForPage(reader, i);

可能性是无限的.

现在我想删除/编辑某个单词,例如明确的词语、敏感信息(将黑框放在上面显然是个坏主意:) 或 PDF 中的任何内容(简单且仅文本).我可以使用上面的方法找到这个词.我可以计算它的出现次数等...

Now I want to remove/redact a certain word, e.g. explicit words, sensitive information (putting black boxes over them obviously is a bad idea :) or whatever from the PDF (which is simple and text only). I can find that word just fine using the approach above. I can count its occurrences etc...

我不关心布局,也不关心 PDF 并不是真的要以这种方式进行操作.

I do not care about layout, or the fact that PDF is not really meant to be manipulated in this way.

我只想知道是否有一种机制可以让我以这种方式操作 PDF 的原始内容.你可以说我正在寻找SetContentBytesForPage()"......

I just wish to know if there is a mechanism that would allow me to manipulate the raw content of my PDF in this way. You could say I'm looking for "SetContentBytesForPage()" ...

推荐答案

如果要更改页面的内容,仅更改页面的内容流是不够的.页面可能包含对包含要删除的内容的 Form XObject 的引用.

If you want to change the content of a page, it isn't sufficient to change the content stream of a page. A page may contain references to Form XObjects that contain content that you want to remove.

次要问题由图像组成.例如:假设您的文档由经过 OCR 处理的扫描文档组成.在这种情况下,仅删除(矢量)文本是不够的,您还需要处理图像中的(像素)文本.

A secondary problem consists of images. For instance: suppose that your document consists of a scanned document that has been OCR'ed. In that case, it isn't sufficient to remove the (vector) text, you'll also need to manipulate the (pixel) text in the image.

假设您的次要问题不存在,您将需要双重方法:

Assuming that your secondary problem doesn't exist, you'll need a double approach:

  1. 从页面中以文本形式获取内容,以检测哪些页面中有您要删除的名称或字词.
  2. 递归遍历所有内容流以找到该文本并重写那些没有该文本的内容流.

从你的问题来看,我假设你已经解决了问题 1.解决问题 2 并不是那么简单.在我的书的第 15 章中,我有一个示例,其中提取文本返回Hello World",但是当您查看内容流时,您会看到:

From your question, I assume that you have already solved problem 1. Solving problem 2 isn't that trivial. In chapter 15 of my book, I have an example where extracting text returns "Hello World", but when you look inside the content stream, you see:

BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET

在您从此流片段中删除Hello World"之前,您需要一些启发式方法,以便您的程序识别此语法中的文本.

Before you can remove "Hello World" from this stream snippet, you'll need some heuristics so that your program recognizes the text in this syntax.

找到文本后,您需要重写流.如需灵感,您可以查看 OCG 移除器功能.

Once you've found the text, you need to rewrite the stream. For inspiration, you can take a look at the OCG remover functionality in the itext-xtra package.

长话短说:如果您的 PDF 相对简单,即:可以在不同的内容流(页面内容和 Form XObject 内容)中轻松检测到文本,那么只需在一些字符串操作后重写这些流即可.

Long story short: if your PDFs are relatively simple, that is: the text can be easily detected in the different content stream (page content and Form XObject content), then it's simply a matter of rewriting those streams after some string manipulations.

我给你做了一个简单的例子,名为 ReplaceStream在 PDF 中将 "Hello World" 替换为 "HELLO WORLD".

I've made you a simple example named ReplaceStream that replaces "Hello World" with "HELLO WORLD" in a PDF.

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream = (PRStream)object;
        byte[] data = PdfReader.getStreamBytes(stream);
        stream.setData(new String(data).replace("Hello World", "HELLO WORLD").getBytes());
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();
}

一些注意事项:

  • 我检查 object 是否是一个流.它也可以是一个 array 流.在这种情况下,您需要遍历该数组.
  • 我不检查是否为页面定义了表单 XObject.
  • 我假设 Hello World 可以在 PDF 语法中轻松检测到.
  • ...
  • I check if object is a stream. It could also be an array of streams. In that case, you need to loop over that array.
  • I don't check if there are form XObjects defined for the page.
  • I assume that Hello World can be easily detected in the PDF Syntax.
  • ...

在现实生活中,PDF 从来没有这么简单,而且您的项目的复杂性会随着文档中使用的每个特殊功能而急剧增加.

In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.

这篇关于iText 或 iTextSharp 基本文本编辑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆