PdfBox - 如何从文本加载颜色 [英] PdfBox - How to load color from text

查看:60
本文介绍了PdfBox - 如何从文本加载颜色的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在许多不同的论坛上看到过这个问题,但我还没有看到它得到正确回答.有一些可能对某些人有用,但它们过于复杂.我自己找到了解决方案,如果您有兴趣为此找到解决方案,请查看答案.

I've seen this question on many different forums, but I have yet to see it answered properly. There are a few that may work for some people, but they are ridiculously overcomplicated. I found out the solution myself, so please check the answer if you are interested in finding the solution for this.

推荐答案

答案:通过 PDFTextStripper 类中的 processTextPosition() 方法提取每个字符的颜色.

The answer: you extract the color for each character via the processTextPosition() method in the PDFTextStripper class.

对于要提取的颜色,需要覆盖 PDFTextStripper 中的构造函数,以便它有更多的运算符从文本中提取颜色,因为这最初不是默认 PDFTextStripper 中的功能.检查 https://pdfbox.apache.org/2.0/migration.html 在文本下提取更多信息.从该链接中,我们找到了要添加到 PDFTextStripper 的覆盖构造函数中的运算符:

For the color to be extracted, the constructor in PDFTextStripper needs to be overwritten so that it has more operators to extract color from the text, as this initially is not a feature within the default PDFTextStripper. Check https://pdfbox.apache.org/2.0/migration.html under Text Extraction for more information. From that link, we find the operators to add to PDFTextStripper's overwritten constructor:

addOperator(new SetStrokingColorSpace());
addOperator(new SetNonStrokingColorSpace());
addOperator(new SetStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetStrokingDeviceRGBColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetStrokingDeviceGrayColor());
addOperator(new SetStrokingColor());
addOperator(new SetStrokingColorN());
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());

然后,我们可以向我们的新子类添加一个布尔值,在处理文本时每次开始新行时,该布尔值将设置为 true:

We can then add a boolean to our new subclass which will be set to true every time a new line is started while the text is being processed:

public class PDFTextStripperSuper extends PDFTextStripper {
    boolean newLine = true;
    
    public PDFTextStripperSuper() throws IOException {
        addOperator(new SetStrokingColorSpace());
        addOperator(new SetNonStrokingColorSpace());
        addOperator(new SetStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceRGBColor());
        addOperator(new SetStrokingDeviceRGBColor());
        addOperator(new SetNonStrokingDeviceGrayColor());
        addOperator(new SetStrokingDeviceGrayColor());
        addOperator(new SetStrokingColor());
        addOperator(new SetStrokingColorN());
        addOperator(new SetNonStrokingColor());
        addOperator(new SetNonStrokingColorN());
    }
    
    @Override
    protected void startPage(PDPage page) throws IOException {
        newLine = true;
        super.startPage(page);
    }

    @Override
    protected void writeLineSeparator() throws IOException {
        newLine = true;
        super.writeLineSeparator();
    }
}

现在我们有了一个文本处理器,可以提取每一行文本以及字符颜色.为了实现这一点,我们所要做的就是覆盖 writeString() 方法来获取每一行文本,以及覆盖 processTextPosition() 方法来获取每个字符的颜色:

So now we have a text processor that is ready to extract each line of text as well as the character colors. To implement this, all we have to do is overwrite the writeString() method to get each line of text, as well as overwrite the processTextPosition() method to get the color of each character:

public class DocAnalyzer {
    public DocAnalyzer(PDDocument doc) throws IOException {
        ArrayList<String> lines = new ArrayList<>();
        ArrayList<PDColor> charColors = new ArrayList<>();
        PDFTextStripperSuper tp = new PDFTextStripperSuper() {
            @Override
            protected void writeString(String text, List<TextPosition> textPositions)
                    throws IOException {
                if (newLine) {
                    lines.add(text);
                    newLine = false;
                }
                super.writeString(text, textPositions);
            }
            
            @Override
            protected void processTextPosition(TextPosition text) {
                super.processTextPosition(text);
                charColors.add(getGraphicsState().getNonStrokingColor());
            }
        };
        
        tp.getText(doc);//processes the text and adds to our lists
    }
}

给你!文本的所有颜色都应该在您的 charColors 列表中.这就是我给你的所有帮助;)!

There you have it! All the colors of the text should be in your charColors list. That's all the help I'm giving you ;)!

这篇关于PdfBox - 如何从文本加载颜色的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆