iText或iTextSharp基本文本编辑 [英] iText or iTextSharp rudimentary text edit

查看:129
本文介绍了iText或iTextSharp基本文本编辑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以通过多种方式从PDF中的页面中提取文本:

I can extract text from pages in a PDF in many ways:

String pageText = PdfTextExtractor.GetTextFromPage(reader, i);

这可以用来获取页面上的任何文字。

This can be used to get any text on a page.

或者:

byte[] contentBytes = iTextSharp.text.pdf.parser.ContentByteUtils.GetContentBytesForPage(reader, i);

可能性无穷无尽。

现在我想删除/编辑某个单词,例如明确的单词,敏感的信息(在他们身上放置黑盒子显然是一个坏主意:)或者PDF中的任何内容(这只是简单的文本)。我可以使用上面的方法找到这个词。我可以计算它的出现次数......

Now I want to remove/redact a certain word, e.g. explicit words, sensitive information (putting black boxes over them obviously is a bad idea :) or whatever from the PDF (which is simple and text only). I can find that word just fine using the approach above. I can count its occurrences etc...

我不关心布局,或者PDF实际上不是以这种方式操纵的事实。

I do not care about layout, or the fact that PDF is not really meant to be manipulated in this way.

我只想知道是否有一种机制可以让我以这种方式操纵PDF的原始内容。你可以说我正在寻找SetContentBytesForPage()......

I just wish to know if there is a mechanism that would allow me to manipulate the raw content of my PDF in this way. You could say I'm looking for "SetContentBytesForPage()" ...

推荐答案

如果你想改变一个内容页面,仅更改页面的内容流是不够的。页面可能包含对包含要删除的内容的表单XObject的引用。

If you want to change the content of a page, it isn't sufficient to change the content stream of a page. A page may contain references to Form XObjects that contain content that you want to remove.

第二个问题包括图像。例如:假设您的文档包含已经过OCR的扫描文档。在这种情况下,仅删除(矢量)文本是不够的,您还需要操纵图像中的(像素)文本。

A secondary problem consists of images. For instance: suppose that your document consists of a scanned document that has been OCR'ed. In that case, it isn't sufficient to remove the (vector) text, you'll also need to manipulate the (pixel) text in the image.

假设你的第二个问题不存在,你需要一个双重方法:

Assuming that your secondary problem doesn't exist, you'll need a double approach:


  1. 从页面获取内容作为文本来检测哪个页面上有你要删除的名称或单词。

  2. 递归循环遍历所有内容流以查找该文本并重写那些没有该文本的内容流。

从你的问题来看,我认为你已经解决了问题1.解决问题2并不是那么简单。在我的书的第15章中,我有一个示例,其中提取文本返回Hello World,但当您查看内容流时,您会看到:

From your question, I assume that you have already solved problem 1. Solving problem 2 isn't that trivial. In chapter 15 of my book, I have an example where extracting text returns "Hello World", but when you look inside the content stream, you see:

BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET

在您从此流片段中删除Hello World之前,您需要一些启发式方法,以便您的程序能够识别此语法中的文本。

Before you can remove "Hello World" from this stream snippet, you'll need some heuristics so that your program recognizes the text in this syntax.

找到文本后,需要重写流。如需灵感,您可以查看 OCG去除功能

Once you've found the text, you need to rewrite the stream. For inspiration, you can take a look at the OCG remover functionality in the itext-xtra package.

长话短说:如果你的PDF相对简单,即:文本可以在不同的内容流(页面内容和表单XObject内容)中轻松检测到,然后只需要在一些字符串操作后重写这些流。

Long story short: if your PDFs are relatively simple, that is: the text can be easily detected in the different content stream (page content and Form XObject content), then it's simply a matter of rewriting those streams after some string manipulations.

我为您制作了一个名为 ReplaceStream 用PDF中的HELLO WORLD替换Hello World

I've made you a simple example named ReplaceStream that replaces "Hello World" with "HELLO WORLD" in a PDF.

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream = (PRStream)object;
        byte[] data = PdfReader.getStreamBytes(stream);
        stream.setData(new String(data).replace("Hello World", "HELLO WORLD").getBytes());
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();
}

一些警告:


  • 我检查 object 是否是一个流。它也可以是流的数组。在这种情况下,您需要遍历该数组。

  • 我不检查是否为页面定义了表单XObject。

  • I假设可以在PDF语法中轻松检测到 Hello World

  • ...

  • I check if object is a stream. It could also be an array of streams. In that case, you need to loop over that array.
  • I don't check if there are form XObjects defined for the page.
  • I assume that Hello World can be easily detected in the PDF Syntax.
  • ...

在现实生活中,PDF文件从未如此简单,并且随着文档中使用的每个特殊功能,项目的复杂性将大大增加。

In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.

这篇关于iText或iTextSharp基本文本编辑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆