适用于Java的高级PDF解析器 [英] Advanced PDF parser for Java

查看:159
本文介绍了适用于Java的高级PDF解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从Java中提取PDF文件中的不同内容:

I want to extract different content from a PDF file in Java:


  • 完整的可见文字

  • images

  • links

是否也可以获得以下内容?

Is it also possible to get the following?


  • 文档元标记,如标题,描述或作者

  • 仅标题

  • 输入元素,如果文档包含表单

我不需要操作或呈现PDF文件。哪个库最适合这种用途?

I do not need to manipulate or render PDF files. Which library would be the best fit for that kind of purpose?

更新

好的,我试过PDFBox:

OK, I tried PDFBox:

Document luceneDocument = LucenePDFDocument.getDocument(new File(path));
Field contents = luceneDocument.getField("contents");
System.out.println(contents.stringValue());

但输出为空。字段摘要是可以的。

But the output is null. The field "summary" is OK though.

下一个代码段工作正常。

The next snippet works fine.

PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(doc);
System.out.println(text);
doc.close();

但是,我不知道如何提取图像,链接等。

But then, I have no clue how to extract the images, links, etc.

更新2

我找到了一个如何提取图像的示例,但我仍然没有回答如何提取:

I found an example how to extract the images, but I still got no answer on how to extract:


  • links

  • 文档元标记,如标题,描述或作者

  • 仅标题

  • 如果文档包含表单,则输入元素

  • links
  • document meta tags like title, description or author
  • only headlines
  • input elements if the document contains a form

推荐答案

iText 是我最近选择的PDF工具。

iText is my PDF tool of choice these days.



  • 完整的可见文字

可见是一个艰难的。您可以使用com.itextpdf.text.pdf.parse包的类解析所有可解析的文本...但这些类不了解CLIPPING。您可以轻松地将解析器限制为页面大小。

"Visible" is a tough one. You can parse out all the parsable text with the com.itextpdf.text.pdf.parse package's classes... but those classes don't know about CLIPPING. You can constrain the parser to the page size easily enough.

// all text on the page, regardless of position
PdfTextExtractor.getTextFromPage(reader, pageNum);

您实际上需要采用TextExtractionStrategy的过滤,即过滤后的策略。它很快就会变得有趣,但我认为你可以在这里开箱即用。

You'd actually need the override that takes a TextExtractionStrategy, the filtered strategy. It gets interesting fairly quickly, but I think you can get everything you want here "out of the box".



  • images

是的,通过相同的包类。图像侦听器不像文本侦听器那样受支持,但确实存在。

Yep, via the same package classes. Image listeners aren't as well supported as text listeners, but do exist.



  • links

是的。链接是各种PDF页面的注释。找到它们只是循环遍历每个页面的注释数组并选择链接注释。

Yes. Links are "annotations" to various PDF pages. Finding them is a simple matter of looping through each page's "annotations array" and picking out the link annotations.

PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
ArrayList<String> dests = new ArrayList<String>();
if (annots != null) {
  for (int i = 0; i < annots.size(); ++i) {
    PdfDictionary annotDict = annots.getAsDict(i);
    PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
    if (subType != null && PdfName.LINK.equals(subType)) {
      PdfDictionary action = annotDict.getAsDict(PdfName.A);
      if (action != null && PdfName.URI.equals(action.getAsName(PdfName.S)) {
        dests.add(action.getAsString(PdfName.URI).toString());
      } // else { its an internal link, meh }
    }
  }
}

您可以找到 PDF规范



  • 输入元素

绝对。对于XFA(LiveCycle Designer)或旧技术AcroForm表格,iText可以找到所有字段,及其价值。

Definitely. For either XFA (LiveCycle Designer) or the older-tech "AcroForm" forms, iText can find all the fields, and their values.

AcroFields fields = myReader.getAcroFields();

Set<String> fieldNames = fields.getFields().keySet();
for (String fldName : fieldNames) {
  System.out.println( fldName + ": " + fields.getField( fldName ) );
}

多重选择列表不会被处理这一切都很好。对于空文本字段和按钮,冒号后面会有一个空格。没有太多的信息.​​.....但这会让你开始。

Mutli-select lists wouldn't be handled all that well. You'll get a blank space after the colon for empty text fields and for buttons. None too informative... but that'll get you started.



  • 文件元标记,如标题,描述或作者

非常简单。是的。

Map<String, String> info = myPdfReader.getInfo();
System.out.println( info );

除了基本的作者/标题/等,还有一个相当复杂的XML架构,您可以通过 reader.getMetadata()

In addition to the basic author/title/etc, there's a fairly involved XML schema you can access via reader.getMetadata().



  • 仅标题

A TextRenderFilter 可以忽略基于文本无论你想要什么标准。字体大小根据您的评论发出正确的声音。

A TextRenderFilter can ignore text based on whatever criteria you wish. Font size sounds about right based on your comment.

这篇关于适用于Java的高级PDF解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆