与文档内容对象和PDFBox中的PDF段落相关的文本 [英] Text associated to PDF paragraph in document content object wit PDFBox

查看:347
本文介绍了与文档内容对象和PDFBox中的PDF段落相关的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取与段落关联的文本,以浏览PDF文件的内容树.我使用的是PDFBox,找不到段落与其包含的文本之间的链接(请参见下面的代码):

I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):

public class ReadPdf  {
public static void main( String[] args ) throws IOException{

    MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
            "C:/Users/wip.txt")));
    RandomAccessFile raf = new RandomAccessFile(new File(
            "C:/Users/mypdf.pdf"), "r");
    PDFParser parser = new PDFParser(raf);
    parser.parse();

    COSDocument cosDoc = parser.getDocument();
    out.write(cosDoc.getXrefTable().toString());
    out.write(cosDoc.getObjects().toString());
    PDDocument document = parser.getPDDocument()
    document.getClass();
    COSParser cosParser = new COSParser(raf);

    PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();

    for (Object kid : treeRoot.getKids()){


        for (Object kid2 :((PDStructureElement)kid).getKids()){
            PDStructureElement kid2c = (PDStructureElement)kid2;

            if (kid2c.getStandardStructureType() == "P"){
                for (Object kid3 : kid2c.getKids()){
                    if (kid3 instanceof PDStructureElement){
                        PDStructureElement kid3c = (PDStructureElement)kid3;
                    }

                    else{

                        for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){


                            // Print all the Keys in the paragraph COSDictionary
                            System.out.println(entry.getKey().toString());
                            System.out.println(entry.getValue().toString());}

                    }}}}}}}

当我打印内容时,我会得到以下按键:

When I print the contents I get the following Keys:

  • /P:对父项的引用
  • /A:段落格式
  • /K:该段落在本节中的位置
  • /C:段落名称(!=文本)
  • /Pg:参考页面

示例输出:

COSName {K}

COSName{K}

COSInt {2}

COSInt{2}

COSName {Pg}

COSName{Pg}

COSObject {12,0}

COSObject{12, 0}

COSName {C}

COSName{C}

COSName {普通}

COSName{Normal}

COSName {A}

COSName{A}

COSObject {434,0}

COSObject{434, 0}

COSName {S}

COSName{S}

COSName {普通}

COSName{Normal}

COSName {P}

COSName{P}

COSObject {421,0}

COSObject{421, 0}

现在这些都没有指向段落中的实际文本. 我知道可以使用acrobat打开文档时解析该关系(请参见下图):

Now none of these points to the actual text inside the paragraph. I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):

推荐答案

我找到了一种通过解析页面中的内容流来实现此目的的方法. 浏览PDF规范第10.6.3章,在\ P \ MCID下的每个文本流的编号与可以在COSObject中找到的Tag的属性(PDFBox中的PDStructureElement)之间存在链接.

I found a way to do this through the parsing of the Content Stream from a page. Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.

1)要获取文本和MCID:

1) To get the text and the MCID:

PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID

  1. 然后获取与MCID匹配的标签及其属性:

  1. Then to get the tags and their attribute that matches MCID:

PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)

PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)

应该这样做.在没有标签的文档中(document.getDocumentCatalog().getStructureTreeRoot()没有子代),无法执行此匹配,但仍可以使用第1步读取文本.

That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.

这篇关于与文档内容对象和PDFBox中的PDF段落相关的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆