iText PDFSweep RegexBasedCleanupStrategy在某些情况下不起作用 [英] iText PDFSweep RegexBasedCleanupStrategy not work in some case

查看:245
本文介绍了iText PDFSweep RegexBasedCleanupStrategy在某些情况下不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用iText PDFSweep RegexBasedCleanupStrategy从pdf中删除一些单词,但是我只想对该单词进行编辑,而不会出现在其他单词中,例如. 我想将"al"修改为一个单词,但是我不想对"mineral"中的"al"进行修改. 因此,我在Regex中添加了border("\ b")一词作为RegexBasedCleanupStrategy的参数,

I'm trying to use iText PDFSweep RegexBasedCleanupStrategy to redact some words from pdf, however I only want to redact the word but not appear in other word, eg. I want to redact "al" as single word, but I don't want to redact the "al" in "mineral". So I add the word boundary("\b") in the Regex as parameter to RegexBasedCleanupStrategy,

  new RegexBasedCleanupStrategy("\\bal\\b")

但是,如果单词在行尾,则pdfAutoSweep.cleanUp无法正常工作.

however the pdfAutoSweep.cleanUp not work if the word is at the end of line.

推荐答案

简而言之

此问题的原因是,将提取的文本块压扁为单个String以应用正则表达式的例程未插入任何换行符.因此,在String中,一行的最后一个字母紧跟着下一行的第一个字母,从而隐藏了单词边界.可以在换行时在String上添加适当的字符来解决此问题.

In short

The cause of this issue is that the routine that flattens the extracted text chunks into a single String for applying the regular expression does not insert any indicator for a line break. Thus, in that String the last letter from one line is immediately followed by the first letter of the next which hides the word boundary. One can fix the behavior by adding an appropriate character to the String in case of a line break.

将提取的文本块压扁为单个String的例程是包com.itextpdf.kernel.pdf.canvas.parser.listener中的CharacterRenderInfo.mapString(List<CharacterRenderInfo>).如果只是水平间隙,此例程将插入一个空格字符,但如果是垂直偏移,即换行,则不会对生成String表示形式的StringBuilder增加任何额外的内容:

The routine that flattens the extracted text chunks into a single String is CharacterRenderInfo.mapString(List<CharacterRenderInfo>) in the package com.itextpdf.kernel.pdf.canvas.parser.listener. In case of a merely horizontal gap this routine inserts a space character but in case of a vertical offset, i.e. a line break, it adds nothing extra to the StringBuilder in which the String representation is generated:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

可能的解决方法

可以将上面的代码扩展为在换行符的情况下插入换行符:

A possible fix

One can extend the code above to insert a newline character in case of a line break:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    sb.append('\n');
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

仅从RegexBasedLocationExtractionStrategy方法getResultantLocations()(程序包com.itextpdf.kernel.pdf.canvas.parser.listener)中调用此CharacterRenderInfo.mapString方法,并且仅用于提到的任务,即应用所讨论的正则表达式.因此,使其能够正确地允许识别单词边界应该不会破坏任何东西,但实际上应被视为一种解决方法.

This CharacterRenderInfo.mapString method is only called from the RegexBasedLocationExtractionStrategy method getResultantLocations() (package com.itextpdf.kernel.pdf.canvas.parser.listener), and only for the task mentioned, i.e. applying the regular expression in question. Thus, enabling it to properly allow recognition of word boundaries should not break anything but indeed should be considered a fix.

一个人可能只是考虑为换行符添加一个不同的字符,例如如果不希望将垂直间隙与水平间隙区别对待,则为一个普通空间' '.因此,对于一般修补程序,可以考虑使此字符成为该策略的可设置属性.

One merely might consider adding a different character for a line break, e.g. a plain space ' ' if one does not want to treat vertical gaps any different than horizontal ones. For a general fix one might, therefore, consider making this character a settable property of the strategy.

我使用iText 7.1.4-SNAPSHOT和PDFSweep 2.0.3-SNAPSHOT进行了测试.

I tested with iText 7.1.4-SNAPSHOT and PDFSweep 2.0.3-SNAPSHOT.

这篇关于iText PDFSweep RegexBasedCleanupStrategy在某些情况下不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆