使用iText替换PDF文件中的文本 [英] Replace text inside a PDF file using iText

查看:233
本文介绍了使用iText替换PDF文件中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iText(5.5.13)库读取.PDF并替换文件中的模式.问题是找不到该模式,因为在库读取pdf时出现了一些奇怪的字符.

Im using iText(5.5.13) library to read a .PDF and replace a pattern inside the file. The problem is that the pattern is not being found because somehow some weird characters appear when the library reads the pdf.

例如,在句子中:

"This is a test in order to see if the"

当我尝试阅读它时成为这个:

becomes this one when I'm trying to read it:

[(This is a )9(te)-3(st)9( in o)-4(rd)15(er )-2(t)9(o)-5( s)8(ee)7( if t)-3(h)3(e )]

因此,如果我尝试查找并替换"test",则在pdf文件中找不到"test"单词,并且不会替换

So if I tried to find and replace "test", no "test" word would be found in the pdf and it won't be replaced

这是我使用的代码:

public void processPDF(String src, String dest) {

    try {

      PdfReader reader = new PdfReader(src);
      PdfArray refs = null;
      PRIndirectReference reference = null;

      int nPages = reader.getNumberOfPages();

      for (int i = 1; i <= nPages; i++) {
        PdfDictionary dict = reader.getPageN(i);
        PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
        if (object.isArray()) {
          refs = dict.getAsArray(PdfName.CONTENTS);
          ArrayList<PdfObject> references = refs.getArrayList();

          for (PdfObject r : references) {

            reference = (PRIndirectReference) r;
            PRStream stream = (PRStream) PdfReader.getPdfObject(reference);
            byte[] data = PdfReader.getStreamBytes(stream);
            String dd = new String(data, "UTF-8");

            dd = dd.replaceAll("@pattern_1234", "trueValue");
            dd = dd.replaceAll("test", "tested");

            stream.setData(dd.getBytes());
          }

        }
        if (object instanceof PRStream) {
          PRStream stream = (PRStream) object;

          byte[] data = PdfReader.getStreamBytes(stream);
          String dd = new String(data, "UTF-8");
          System.out.println("content---->" + dd);
          dd = dd.replaceAll("@pattern_1234", "trueValue");
          dd = dd.replaceAll("This", "FIRST");

          stream.setData(dd.getBytes(StandardCharsets.UTF_8));
        }
      }
      PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
      stamper.close();
      reader.close();
    }

    catch (Exception e) {
    }
  }

推荐答案

正如评论和答案中已经提到的那样,PDF不是用于文本编辑的格式.这是一种最终格式,有关文本流,布局,甚至映射到Unicode的信息都是可选的.

As has already been mentioned in comments and answers, PDF is not a format meant for text editing. It is a final format, and information on the flow of text, its layout, and even its mapping to Unicode is optional.

因此,即使假设存在有关将字形映射到Unicode的可选信息,使用iText进行此任务的方法也可能看起来有些不令人满意:首先,将使用自定义文本提取策略确定所讨论文本的位置,然后继续使用PdfCleanUpProcessor删除该位置上所有内容的当前内容,最后将替换文本绘制到空白处.

Thus, even assuming the optional information on mapping glyphs to Unicode are present, the approach to this task with iText might look a bit unsatisfying: First one would determine the position of the text in question using a custom text extraction strategy, then continue by removing the current contents of everything at that position using the PdfCleanUpProcessor, and finally draw the replacement text into the gap.

在此答案中,我将介绍一个帮助程序类,该类允许组合前两个步骤(查找和删除现有文本),并具有仅删除 text 而不是 not 还有任何背景图形等.辅助程序还返回已删除文本的位置,以便在其上加盖替换标记.

In this answer I would present a helper class allowing to combine the first two steps, finding and removing the existing text, with the advantage that indeed only the text is removed, not also any background graphics etc. as in case of PdfCleanUpProcessor redaction. The helper furthermore returns the positions of the removed text allowing stamping of replacement thereon.

帮助程序类基于此先前的答案中提供的PdfContentStreamEditor.请使用

The helper class is based on the PdfContentStreamEditor presented in this earlier answer. Please use the version of this class on github, though, as the original class has been enhanced a bit since conception.

The SimpleTextRemover helper class illustrates what is necessary to properly remove text from a PDF. Actually it is limited in a few aspects:

  • 它仅替换实际页面内容流中的文本.

  • It only replaces text in the actual page content streams.

要还替换嵌入式XObjects中的文本,必须递归遍历所涉及页面的XObject资源,并将编辑器应用于它们.

To also replace text in embedded XObjects, one has to iterate through the XObject resources of the respective page in question recursively and also apply the editor to them.

它是简单的",与SimpleTextExtractionStrategy的用法相同:它假定显示说明的文本以阅读顺序出现在内容中.

It is "simple" in the same way the SimpleTextExtractionStrategy is: It assumes the text showing instructions to appear in the content in reading order.

也要使用顺序不同且必须对指令进行排序的内容流,这意味着必须将所有传入指令和相关渲染信息缓存到页面末尾,而不是一次仅存储几个指令.然后可以对渲染信息进行排序,可以在排序后的渲染信息中标识要删除的部分,可以对关联的指令进行操作,最终可以存储这些指令.

To also work with content streams for which the order is different and the instructions must be sorted, and this implies that all incoming instructions and relevant render information must be cached until the end of page, not merely a few instruction at a time. Then the render information can be sorted, sections to remove can be identified in the sorted render information, the associated instructions can be manipulated, and the instructions can eventually be stored.

它不会尝试识别在视觉上代表空白的字形之间的间隙,而实际上根本没有字形.

It does not try to identify gaps between glyphs that visually represent a white space while there actually is no glyph at all.

要识别间隙,必须扩展代码以检查两个连续的字形是否正好彼此跟随,或者检查是否存在间隙或行跳动.

To identify gaps the code must be extended to check whether two consecutive glyphs exactly follow one another or whether there is a gap or a line jump.

在计算留出字形的间隙时,尚未考虑字符和单词的间距.

When calculating the gap to leave where a glyph is removed, it does not yet take the character and word spacing into account.

为此,必须改进字形宽度的计算.

To improve this, the glyph width calculation must be improved.

不过,考虑到您从内容流中摘录的示例,这些限制很可能不会妨碍您.

Considering your example excerpt from your content stream, though, you these restrictions probably won't hinder you.

public class SimpleTextRemover extends PdfContentStreamEditor {
    public SimpleTextRemover() {
        super (new SimpleTextRemoverListener());
        ((SimpleTextRemoverListener)getRenderListener()).simpleTextRemover = this;
    }

    /**
     * <p>Removes the string to remove from the given page of the
     * document in the PDF reader the given PDF stamper works on.</p>
     * <p>The result is a list of glyph lists each of which represents
     * a match can can be queried for position information.</p>
     */
    public List<List<Glyph>> remove(PdfStamper pdfStamper, int pageNum, String toRemove) throws IOException {
        if (toRemove.length()  == 0)
            return Collections.emptyList();

        this.toRemove = toRemove;
        cachedOperations.clear();
        elementNumber = -1;
        pendingMatch.clear();
        matches.clear();
        allMatches.clear();
        editPage(pdfStamper, pageNum);
        return allMatches;
    }

    /**
     * Adds the given operation to the cached operations and checks
     * whether some cached operations can meanwhile be processed and
     * written to the result content stream.
     */
    @Override
    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        cachedOperations.add(new ArrayList<>(operands));

        while (process(processor)) {
            cachedOperations.remove(0);
        }
    }

    /**
     * Removes any started match and sends all remaining cached
     * operations for processing.
     */
    @Override
    public void finalizeContent() {
        pendingMatch.clear();
        try {
            while (!cachedOperations.isEmpty()) {
                if (!process(this)) {
                    // TODO: Should not happen, so warn
                    System.err.printf("Failure flushing operation %s; dropping.\n", cachedOperations.get(0));
                }
                cachedOperations.remove(0);
            }
        } catch (IOException e) {
            throw new ExceptionConverter(e);
        }
    }

    /**
     * Tries to process the first cached operation. Returns whether
     * it could be processed.
     */
    boolean process(PdfContentStreamProcessor processor) throws IOException {
        if (cachedOperations.isEmpty())
            return false;

        List<PdfObject> operands = cachedOperations.get(0);
        PdfLiteral operator = (PdfLiteral) operands.get(operands.size() - 1);
        String operatorString = operator.toString();

        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            return processTextShowingOp(processor, operator, operands);

        super.write(processor, operator, operands);
        return true;
    }

    /**
     * Tries to processes a text showing operation. Unless a match
     * is pending and starts before the end of the argument of this
     * instruction, it can be processed. If the instructions contains
     * a part of a match, it is transformed to a TJ operation and
     * the glyphs in question are replaced by text position adjustments.
     * If the original operation had a side effect (jump to next line
     * or spacing adjustment), this side effect is explicitly added.
     */
    boolean processTextShowingOp(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        PdfObject object = operands.get(operands.size() - 2);
        boolean isArray = object instanceof PdfArray;
        PdfArray array = isArray ? (PdfArray) object : new PdfArray(object);
        int elementCount = countStrings(object);

        // Currently pending glyph intersects parameter of this operation -> cannot yet process
        if (!pendingMatch.isEmpty() && pendingMatch.get(0).elementNumber < processedElements + elementCount)
            return false;

        // The parameter of this operation is subject to a match -> copy as is
        if (matches.size() == 0 || processedElements + elementCount <= matches.get(0).get(0).elementNumber || elementCount == 0) {
            super.write(processor, operator, operands);
            processedElements += elementCount;
            return true;
        }

        // The parameter of this operation contains glyphs of a match -> manipulate 
        PdfArray newArray = new PdfArray();
        for (int arrayIndex = 0; arrayIndex < array.size(); arrayIndex++) {
            PdfObject entry = array.getPdfObject(arrayIndex);
            if (!(entry instanceof PdfString)) {
                newArray.add(entry);
            } else {
                PdfString entryString = (PdfString) entry;
                byte[] entryBytes = entryString.getBytes();
                for (int index = 0; index < entryBytes.length; ) {
                    List<Glyph> match = matches.size() == 0 ? null : matches.get(0);
                    Glyph glyph = match == null ? null : match.get(0);
                    if (glyph == null || processedElements < glyph.elementNumber) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, entryBytes.length)));
                        break;
                    }
                    if (index < glyph.index) {
                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, glyph.index)));
                        index = glyph.index;
                        continue;
                    }
                    newArray.add(new PdfNumber(-glyph.width));
                    index++;
                    match.remove(0);
                    if (match.isEmpty())
                        matches.remove(0);
                }
                processedElements++;
            }
        }
        writeSideEffect(processor, operator, operands);
        writeTJ(processor, newArray);

        return true;
    }

    /**
     * Counts the strings in the given argument, itself a string or
     * an array containing strings and non-strings.
     */
    int countStrings(PdfObject textArgument) {
        if (textArgument instanceof PdfArray) {
            int result = 0;
            for (PdfObject object : (PdfArray)textArgument) {
                if (object instanceof PdfString)
                    result++;
            }
            return result;
        } else 
            return textArgument instanceof PdfString ? 1 : 0;
    }

    /**
     * Writes side effects of a text showing operation which is going to be
     * replaced by a TJ operation. Side effects are line jumps and changes
     * of character or word spacing.
     */
    void writeSideEffect(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
        switch (operator.toString()) {
        case "\"":
            super.write(processor, OPERATOR_Tw, Arrays.asList(operands.get(0), OPERATOR_Tw));
            super.write(processor, OPERATOR_Tc, Arrays.asList(operands.get(1), OPERATOR_Tc));
        case "'":
            super.write(processor, OPERATOR_Tasterisk, Collections.singletonList(OPERATOR_Tasterisk));
        }
    }

    /**
     * Writes a TJ operation with the given array unless array is empty.
     */
    void writeTJ(PdfContentStreamProcessor processor, PdfArray array) throws IOException {
        if (!array.isEmpty()) {
            List<PdfObject> operands = Arrays.asList(array, OPERATOR_TJ);
            super.write(processor, OPERATOR_TJ, operands);
        }
    }

    /**
     * Analyzes the given text render info whether it starts a new match or
     * finishes / continues / breaks a pending match. This method is called
     * by the {@link SimpleTextRemoverListener} registered as render listener
     * of the underlying content stream processor.
     */
    void renderText(TextRenderInfo renderInfo) {
        elementNumber++;
        int index = 0;
        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) {
            int matchPosition = pendingMatch.size();
            pendingMatch.add(new Glyph(info, elementNumber, index));
            if (!toRemove.substring(matchPosition, matchPosition + info.getText().length()).equals(info.getText())) {
                reduceToPartialMatch();
            }
            if (pendingMatch.size() == toRemove.length()) {
                matches.add(new ArrayList<>(pendingMatch));
                allMatches.add(new ArrayList<>(pendingMatch));
                pendingMatch.clear();
            }
            index++;
        }
    }

    /**
     * Reduces the current pending match to an actual (partial) match
     * after the addition of the next glyph has invalidated it as a
     * whole match.
     */
    void reduceToPartialMatch() {
        outer:
        while (!pendingMatch.isEmpty()) {
            pendingMatch.remove(0);
            int index = 0;
            for (Glyph glyph : pendingMatch) {
                if (!toRemove.substring(index, index + glyph.text.length()).equals(glyph.text)) {
                    continue outer;
                }
                index++;
            }
            break;
        }
    }

    String toRemove = null;
    final List<List<PdfObject>> cachedOperations = new LinkedList<>();

    int elementNumber = -1;
    int processedElements = 0;
    final List<Glyph> pendingMatch = new ArrayList<>();
    final List<List<Glyph>> matches = new ArrayList<>();
    final List<List<Glyph>> allMatches = new ArrayList<>();

    /**
     * Render listener class used by {@link SimpleTextRemover} as listener
     * of its content stream processor ancestor. Essentially it forwards
     * {@link TextRenderInfo} events and ignores all else.
     */
    static class SimpleTextRemoverListener implements RenderListener {
        @Override
        public void beginTextBlock() { }

        @Override
        public void renderText(TextRenderInfo renderInfo) {
            simpleTextRemover.renderText(renderInfo);
        }

        @Override
        public void endTextBlock() { }

        @Override
        public void renderImage(ImageRenderInfo renderInfo) { }

        SimpleTextRemover simpleTextRemover = null;
    }

    /**
     * Value class representing a glyph with information on
     * the displayed text and its position, the overall number
     * of the string argument of a text showing instruction
     * it is in and the index at which it can be found therein,
     * and the width to use as text position adjustment when
     * replacing it. Beware, the width does not yet consider
     * character and word spacing!
     */
    public static class Glyph {
        public Glyph(TextRenderInfo info, int elementNumber, int index) {
            text = info.getText();
            ascent = info.getAscentLine();
            base = info.getBaseline();
            descent = info.getDescentLine();
            this.elementNumber = elementNumber;
            this.index = index;
            this.width = info.getFont().getWidth(text);
        }

        public final String text;
        public final LineSegment ascent;
        public final LineSegment base;
        public final LineSegment descent;
        final int elementNumber;
        final int index;
        final float width;
    }

    final PdfLiteral OPERATOR_Tasterisk = new PdfLiteral("T*");
    final PdfLiteral OPERATOR_Tc = new PdfLiteral("Tc");
    final PdfLiteral OPERATOR_Tw = new PdfLiteral("Tw");
    final PdfLiteral OPERATOR_Tj = new PdfLiteral("Tj");
    final PdfLiteral OPERATOR_TJ = new PdfLiteral("TJ");
    final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    final static Glyph[] EMPTY_GLYPH_ARRAY = new Glyph[0];
}

(

(SimpleTextRemover helper class)

您可以像这样使用它:

PdfReader pdfReader = new PdfReader(SOURCE);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
SimpleTextRemover remover = new SimpleTextRemover();

System.out.printf("\ntest.pdf - Test\n");
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
    System.out.printf("Page %d:\n", i);
    List<List<Glyph>> matches = remover.remove(pdfStamper, i, "Test");
    for (List<Glyph> match : matches) {
        Glyph first = match.get(0);
        Vector baseStart = first.base.getStartPoint();
        Glyph last = match.get(match.size()-1);
        Vector baseEnd = last.base.getEndPoint();
        System.out.printf("  Match from (%3.1f %3.1f) to (%3.1f %3.1f)\n", baseStart.get(I1), baseStart.get(I2), baseEnd.get(I1), baseEnd.get(I2));
    }
}

pdfStamper.close();

((,其中包含我的测试文件的以下控制台输出:

with the following console output for my test file:

test.pdf - Test
Page 1:
  Match from (134,8 666,9) to (177,8 666,9)
  Match from (134,8 642,0) to (153,4 642,0)
  Match from (172,8 642,0) to (191,4 642,0)

,并且在输出PDF中的那些位置缺少测试"的出现.

and the occurrences of "Test" missing at those positions in the output PDF.

您可以使用它们在相关位置绘制替换文本,而不是输出匹配坐标.

Instead of outputting the match coordinates, you can use them to draw replacement text at the position in question.

这篇关于使用iText替换PDF文件中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆