从PDF过滤出超过特定字体大小的所有文本 [英] Filter out all text above a certain font size from PDF

查看:103
本文介绍了从PDF过滤出超过特定字体大小的所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如标题所述,我想从某个字体大小以上的PDF过滤掉所有文本.目前,我正在使用PDFBox库,但可以使用其他任何免费的Java库.

As the title says, I want to filter out all text from a PDF that is above a certain font size. Currently, I am using the PDFBox library but I am open to using any other free library for Java.

我的方法是使用PDFStreamParser遍历令牌.当我通过一个大小大于我的阈值的Tf运算符时,不要添加下一个看到的Tj/TJ.但是,对我来说很明显,这种相对简单的方法将无法工作,因为文本可能会被当前的转换矩阵缩放.

My approach was to use a PDFStreamParser to iterate through the tokens. When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen. However, it has become clear to me that this relatively simple approach will not work because the text may be scaled by the current transformation matrix.

我是否可以采用更好的方法,或者在不变得太复杂的情况下使我的方法可行?

Is there a better approach I could be taking, or a way to make my approach work without it getting too complicated?

推荐答案

您的方法

当我通过一个大小大于我的阈值的Tf运算符时,不要添加下一个看到的Tj/TJ.

When I pass a Tf operator that has a size greater than my threshold, don't add the next Tj/TJ that is seen.

太简单了.

一方面,当您评论自己时,

On one hand, as you remark yourself,

文本可以通过当前的变换矩阵缩放.

the text may be scaled by the current transformation matrix.

(实际上不仅取决于变换矩阵,而且还取决于文本矩阵!)

(Actually not only by the transformation matrix but also by the text matrix!)

因此,您必须跟踪这些矩阵.

Thus, you have to keep track of these matrices.

另一方面, Tf 不仅设置了所看到的下一个文本绘制指令的基本字体大小,还设置了字体大小,直到某些人明确更改了字体大小为止.其他说明.

On the other hand Tf doesn't only set the base font size for the next text drawing instruction seen, it sets it until the size is explicitly changed by some other instruction.

此外,文本字体大小和当前转换矩阵是图形状态的一部分;因此,它们受保存状态和恢复状态指令的约束.

Furthermore, the text font size and the current transformation matrix are part of the graphics state; thus, they are subject to save state and restore state instructions.

因此,要针对当前状态编辑内容流,您必须跟踪许多信息.幸运的是,PDFBox包含的类可以在此进行繁重的工作,基于PDFStreamEngine的类层次结构使您可以专心完成任务.为了使尽可能多的信息可用于编辑,PDFGraphicsStreamEngine类似乎是一个很好的选择.

To edit a content stream with respect to the current state, therefore, you have to keep track of a lot of information. Fortunately, PDFBox contains classes to do the heavy lifting here, the class hierarchy based on the PDFStreamEngine, allowing you to concentrate on your task. To have as much information as possible available for editing, the PDFGraphicsStreamEngine class appears to be a good choice to build upon.

因此,让我们从PDFGraphicsStreamEngine派生PdfContentStreamEditor并添加一些代码以生成替换内容流.

Thus, let's derive PdfContentStreamEditor from PDFGraphicsStreamEngine and add some code for generating a replacement content stream.

public class PdfContentStreamEditor extends PDFGraphicsStreamEngine {
    public PdfContentStreamEditor(PDDocument document, PDPage page) {
        super(page);
        this.document = document;
    }

    /**
     * <p>
     * This method retrieves the next operation before its registered
     * listener is called. The default does nothing.
     * </p>
     * <p>
     * Override this method to retrieve state information from before the
     * operation execution.
     * </p> 
     */
    protected void nextOperation(Operator operator, List<COSBase> operands) {
        
    }

    /**
     * <p>
     * This method writes content stream operations to the target canvas. The default
     * implementation writes them as they come, so it essentially generates identical
     * copies of the original instructions {@link #processOperator(Operator, List)}
     * forwards to it.
     * </p>
     * <p>
     * Override this method to achieve some fancy editing effect.
     * </p> 
     */
    protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
        contentStreamWriter.writeTokens(operands);
        contentStreamWriter.writeToken(operator);
    }

    // stub implementation of PDFGraphicsStreamEngine abstract methods
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException { }

    @Override
    public void drawImage(PDImage pdImage) throws IOException { }

    @Override
    public void clip(int windingRule) throws IOException { }

    @Override
    public void moveTo(float x, float y) throws IOException { }

    @Override
    public void lineTo(float x, float y) throws IOException { }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException { }

    @Override
    public Point2D getCurrentPoint() throws IOException { return null; }

    @Override
    public void closePath() throws IOException { }

    @Override
    public void endPath() throws IOException { }

    @Override
    public void strokePath() throws IOException { }

    @Override
    public void fillPath(int windingRule) throws IOException { }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException { }

    @Override
    public void shadingFill(COSName shadingName) throws IOException { }

    // PDFStreamEngine overrides to allow editing
    @Override
    public void processPage(PDPage page) throws IOException {
        PDStream stream = new PDStream(document);
        replacement = new ContentStreamWriter(replacementStream = stream.createOutputStream(COSName.FLATE_DECODE));
        super.processPage(page);
        replacementStream.close();
        page.setContents(stream);
        replacement = null;
        replacementStream = null;
    }

    @Override
    public void showForm(PDFormXObject form) throws IOException {
        // DON'T descend into XObjects
        // super.showForm(form);
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        nextOperation(operator, operands);
        super.processOperator(operator, operands);
        write(replacement, operator, operands);
    }

    final PDDocument document;
    OutputStream replacementStream = null;
    ContentStreamWriter replacement = null;
}

(此代码将覆盖processPage,以创建新的页面内容流,并最终将其替换为旧的内容流.并且会覆盖processOperator以提供已处理的指令进行编辑.

This code overrides processPage to create a new page content stream and eventually replace the old one with it. And it overrides processOperator to provide the processed instruction for editing.

对于编辑,此处仅覆盖write.现有的实现只是简单地编写指令,而您可以更改要编写的指令.覆盖nextOperation允许您在应用当前指令之前先查看图形状态 .

For editing one simply overrides write here. The existing implementation simply writes the instructions as they come while you may change the instructions to write. Overriding nextOperation allows you to peek at the graphics state before the current instruction is applied to it.

按原样应用编辑器,

PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page);
    identity.processPage(page);
}
document.save(RESULT);

(因此,将创建具有等效内容流的结果PDF.

therefore, will create a result PDF with equivalent content streams.

您要

从PDF中过滤出所有超过特定字体大小的文本.

filter out all text from a PDF that is above a certain font size.

因此,我们必须检查write当前指令是否为文本绘制指令,如果是,则必须检查当前有效字体大小,即由文本矩阵和字体转换的基本字体大小.当前的变换矩阵.如果有效字体太大,我们必须删除指令.

Thus, we have to check in write whether the current instruction is a text drawing instruction, and if it is, we have to check the current effective font size, i.e. the base font size transformed by the text matrix and the current transformation matrix. If the effective font size is too large, we have to drop the instruction.

这可以如下进行:

PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor identity = new PdfContentStreamEditor(document, page) {
        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            String operatorString = operator.getName();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString))
            {
                float fs = getGraphicsState().getTextState().getFontSize();
                Matrix matrix = getTextMatrix().multiply(getGraphicsState().getCurrentTransformationMatrix());
                Point2D.Float transformedFsVector = matrix.transformPoint(0, fs);
                Point2D.Float transformedOrigin = matrix.transformPoint(0, 0);
                double transformedFs = transformedFsVector.distance(transformedOrigin);
                if (transformedFs > 100)
                    return;
            }

            super.write(contentStreamWriter, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    identity.processPage(page);
}
document.save(RESULT);

(严格来说,完全放弃所涉及的指令可能是不够的;取而代之的是,必须像替换放置的文本绘制指令一样,用一条指令来更改文本矩阵来替换它.否则,以下未删除的文本可能会被移动.通常,这确实可以按原样工作,因为为下面的不同文本重新设置了文本矩阵.因此,让我们在这里保持简单.

Strictly speaking completely dropping the instruction in question may not suffice; instead, one would have to replace it with an instruction to change the text matrix just like the dropped text drawing instructions would have done. Otherwise the following not-dropped text may be moved. Often, though, this does work as is because the text matrix is newly set for the following different text. So let's keep it simple here.

PdfContentStreamEditor仅编辑页面内容流.从那里开始,可以使用XObjects和Patterns,它们当前未被编辑器编辑.不过,在编辑页面内容流之后,应该很容易地递归地迭代XObjects和Patterns并以类似的方式对其进行编辑.

This PdfContentStreamEditor only edits the page content stream. From there XObjects and Patterns may be used which are currently not edited by the editor. It should be easy, though, to, after editing the page content stream, recursively iterate of the XObjects and Patterns and edit them in a similar fashion.

PdfContentStreamEditor本质上是此答案PdfContentStreamEditor端口. a>和PdfCanvasEditor iText 7的PdfCanvasEditor来自此答案.使用这些编辑器类的示例可能会提示如何将PdfContentStreamEditor用于PDFBox.

This PdfContentStreamEditor essentially is a port of the PdfContentStreamEditor for iText 5 (.Net/Java) from this answer and the PdfCanvasEditor for iText 7 from this answer. The examples for using those editor classes may give some hints on how to use this PdfContentStreamEditor for PDFBox.

以前在此答案中的testarea/pdfbox2/content/HelloSignManipulator.java#L42"rel =" nofollow noreferrer> HelloSignManipulator 类 a>.

A similar (but less generic) approach has been used previously in the HelloSignManipulator class in this answer.

此问题的上下文中发现了PdfContentStreamEditor中的错误,该错误导致示例中的某些文本行焦点在其中的PDF将被移动.

In the context of this question a bug in the PdfContentStreamEditor was found which caused some text lines in the example PDF in focus there to be moved.

背景:某些PDF指令是通过其他指令定义的,例如 t x t y TD 被指定为与具有相同的作用> -t y TL t x t y Td .为了简化工作,相应的PDFBox OperatorProcessor实现将等效指令反馈回流引擎中.

The background: Some PDF instructions are defined via other ones, e.g. tx ty TD is specified to have the same effect as -ty TL tx ty Td. The corresponding PDFBox OperatorProcessor implementations for simplicity work by feeding the equivalent instructions back into the stream engine.

在这种情况下,如上实现的PdfContentStreamEditor检索替换指令和原始指令的信号,并将它们全部写回到结果流中.因此,这些指令的效果加倍.例如.如果使用 TD 指令,则文本插入点将前移两行,而不是前一行...

The PdfContentStreamEditor as implemented above in such a case retrieves signals for both the replacement instructions and the original instructions and writes them all back into the result stream. Thus, the effect of those instructions is doubled. E.g. in case of the TD instruction the text insertion point is forwarded two lines instead of one...

因此,我们必须忽略更换说明.为此,将上面的方法processOperator替换为

Thus, we have to ignore the replacement instructions. For this replace the method processOperator above by

@Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
    if (inOperator) {
        super.processOperator(operator, operands);
    } else {
        inOperator = true;
        nextOperation(operator, operands);
        super.processOperator(operator, operands);
        write(replacement, operator, operands);
        inOperator = false;
    }
}

boolean inOperator = false;

这篇关于从PDF过滤出超过特定字体大小的所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆