使用PDFBox获取PDF TextObjects [英] Getting PDF TextObjects with PDFBox

查看:439
本文介绍了使用PDFBox获取PDF TextObjects的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PDF,我使用PDFBox从中提取了一个页面:

I have a PDF from which I extracted a page using PDFBox:

(...)
File input = new File("C:\\temp\\sample.pdf");
document = PDDocument.load(input);
List allPages = document.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) allPages.get(2);
PDStream contents = page.getContents();
if (contents != null) {
System.out.println(contents.getInputStreamAsString());
(...)

这给出了以下结果,看起来像你' d期望,基于 PDF spec

This gives the following result, which looks like something you'd expect, based on the PDF spec.

q
/GS0 gs
/Fm0 Do
Q
/Span <</Lang (en-US)/MCID 88 >>BDC 
BT
/CS0 cs 0 0 0  scn
/GS1 gs
/T1_0 1 Tf
8.5 0 0 8.5 70.8661 576 Tm
(This page has been intentionally left blank.)Tj
ET
EMC 
1 1 1  scn
/GS0 gs
22.677 761.102 28.346 32.599 re
f
/Span <</Lang (en-US)/MCID 89 >>BDC 
BT
0.531 0.53 0.528  scn
/T1_1 1 Tf
9 0 0 9 45.7136 761.1024 Tm
(2)Tj
ET
EMC 
q
0 g
/Fm1 Do
Q

我在寻找什么for是基本上将页面上的PDF TextObjects(如PDF规范的5.3中所述)提取为java Objects BT和ET之间的部分(本页的两个'en)。
它们至少应包含Tj前面的括号之间的所有字符串,以及基于Tm(或Td运算符等)的x和ycoördinate。其他属性是奖励,但不是必需的。

What I'm looking for is to extract the PDF TextObjects (as described in par 5.3 of the PDF spec) on the page as java Objects, so basically the pieces between BT an ET (two of 'en on this page). They should at least contain everything between the brackets preceding 'Tj' as a String, and an x and y coördinate based on the 'Tm' (or a 'Td' operator, etc.). Other attributes would be a bonus, but are not required.

PDFTextStripper似乎给我每个角色的属性作为TextPosition(为我的目的太多噪音),或者所有文本作为一个长字符串。

The PDFTextStripper seems to give me either each character with attributes as a TextPosition (too much noise for my purpose), or all the Text as one long String.

PDFBox是否有一个功能可以解析一个页面并提供我错过的像这样的TextObjects?或者,如果我要扩展PDFBox以获得我需要的东西,我应该从哪里开始?欢迎任何帮助。

Does PDFBox have a feature that parses a Page and provides TextObjects like this that I missed? Or else, if I am to extend PDFBox to get what I need, where should I start? Any help is welcome.

编辑:发现另一个问题这里,它提供了我如何构建我需要的灵感。如果我成功了,我会回来看看。不过还是期待你的任何帮助。

EDIT: Found another question here, that gives inspiration on how I might build what I need. If I succeed, I'll check back. Still looking forward to any help you may have, though.

谢谢,

Phil

推荐答案

基于链接的问题和 mkl的提示昨天(谢谢!),我决定构建一些东西来解析令牌。
要考虑的是在PDF文本对象中,属性在操作符之前,所以我收集集合中的所有属性,直到遇到操作符。
然后,当我知道属性属于哪个运算符时,我将它们移动到适当的位置。
这就是我提出的:

Based on the linked question and the hint by mkl yesterday (thanks!), I've decided to build something to parse the tokens. Something to consider is that within a PDF Text Object, the attributes precede the operator, so I collect all attributes in a collection until I encounter the operator. Then, when I know what operator the attributes belong to, I move them to their proper locations. This is what I've come up with:

import java.io.File;
import java.util.List;

import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFOperator;

public class TextExtractor {
    public static void main(String[] args) { 
        try {
            File input = new File("C:\\some\\file.pdf");
            PDDocument document = PDDocument.load(input);
            List allPages = document.getDocumentCatalog().getAllPages();
            // just parsing page 2 here, as it's only a sample
            PDPage page = (PDPage) allPages.get(2);
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream());
            parser.parse();  
            List tokens = parser.getTokens();  
            boolean parsingTextObject = false; //boolean to check whether the token being parsed is part of a TextObject
            PDFTextObject textobj = new PDFTextObject();
            for (int i = 0; i < tokens.size(); i++)  
            {  
                Object next = tokens.get(i); 
                if (next instanceof PDFOperator)  {
                    PDFOperator op = (PDFOperator) next;  
                    switch(op.getOperation()){
                        case "BT":
                            //BT: Begin Text. 
                            parsingTextObject = true;
                            textobj = new PDFTextObject();
                            break;
                        case "ET":
                            parsingTextObject = false;
                            System.out.println("Text: " + textobj.getText() + "@" + textobj.getX() + "," + textobj.getY());
                            break;
                        case "Tj":
                            textobj.setText();
                            break;
                        case "Tm":
                            textobj.setMatrix();
                            break;
                        default:
                            //System.out.println("unsupported operation " + op.getOperation());
                    }
                    textobj.clearAllAttributes();
                }
                else if (parsingTextObject)  {
                    textobj.addAttribute(next);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } 
    }
}

结合使用:

import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.cos.COSFloat;
import org.apache.pdfbox.cos.COSInteger;
import org.apache.pdfbox.cos.COSString;

class PDFTextObject{
    private List attributes = new ArrayList<Object>();
    private String text = "";
    private float x = -1;
    private float y = -1;

    public void clearAllAttributes(){
        attributes = new ArrayList<Object>();
    }

    public void addAttribute(Object anAttribute){
        attributes.add(anAttribute);
    }

    public void setText(){
        //Move the contents of the attributes to the text attribute.
        for (int i = 0; i < attributes.size(); i++){
            if (attributes.get(i) instanceof COSString){
                COSString aString = (COSString) attributes.get(i);
                text = text + aString.getString();
            }
            else {
                System.out.println("Whoops! Wrong type of property...");
            }
        }
    }

    public String getText(){
        return text;
    }

    public void setMatrix(){
        //Move the contents of the attributes to the x and y attributes.
        //A Matrix has 6 attributes, the last two of which are x and y
        for (int i = 4; i < attributes.size(); i++){
            float curval = -1;
            if (attributes.get(i) instanceof COSInteger){
                COSInteger aCOSInteger = (COSInteger) attributes.get(i); 
                curval = aCOSInteger.floatValue();

            }
            if (attributes.get(i) instanceof COSFloat){
                COSFloat aCOSFloat = (COSFloat) attributes.get(i);
                curval = aCOSFloat.floatValue();
            }
            switch(i) {
                case 4:
                    x = curval;
                    break;
                case 5:
                    y = curval;
                    break;
            }
        }
    }

    public float getX(){
        return x;
    }

    public float getY(){
        return y;
    }
}

它给出了输出:

Text: This page has been intentionally left blank.@70.8661,576.0
Text: 2@45.7136,761.1024

虽然它可以解决问题,但我确信我已经破坏了一些惯例并且并不总是编写最优雅的代码。欢迎改进和替代解决方案。

While it does the trick, I'm sure I've broken some conventions and haven't always written the most elegant code. Improvements and alternate solutions are welcome.

这篇关于使用PDFBox获取PDF TextObjects的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆