适用于 Java 的高级 PDF 解析器 [英] Advanced PDF parser for Java

查看:46
本文介绍了适用于 Java 的高级 PDF 解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 Java 从 PDF 文件中提取不同的内容:

I want to extract different content from a PDF file in Java:

  • 完整的可见文本
  • 图片
  • 链接

是否也可以得到以下内容?

Is it also possible to get the following?

  • 文档元标记,如标题、描述或作者
  • 只有标题
  • 输入元素(如果文档包含表单)

我不需要操作或渲染 PDF 文件.哪个库最适合这种目的?

I do not need to manipulate or render PDF files. Which library would be the best fit for that kind of purpose?

更新

好的,我试过 PDFBox:

OK, I tried PDFBox:

Document luceneDocument = LucenePDFDocument.getDocument(new File(path));
Field contents = luceneDocument.getField("contents");
System.out.println(contents.stringValue());

但输出为空.不过,摘要"字段还可以.

But the output is null. The field "summary" is OK though.

下一个片段工作正常.

PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(doc);
System.out.println(text);
doc.close();

但是,我不知道如何提取图像、链接等

But then, I have no clue how to extract the images, links, etc.

更新 2

我找到了一个如何提取图像的例子,但我仍然没有关于如何提取的答案:

I found an example how to extract the images, but I still got no answer on how to extract:

  • 链接
  • 文档元标记,如标题、描述或作者
  • 只有标题
  • 输入元素(如果文档包含表单)

推荐答案

iText 是我选择的 PDF 工具这些天.

iText is my PDF tool of choice these days.

  • 完整的可见文本

可见"是一个艰难的选择.您可以使用 com.itextpdf.text.pdf.parse 包的类解析出所有可解析的文本……但这些类不知道 CLIPPING.您可以很容易地将解析器限制为页面大小.

"Visible" is a tough one. You can parse out all the parsable text with the com.itextpdf.text.pdf.parse package's classes... but those classes don't know about CLIPPING. You can constrain the parser to the page size easily enough.

// all text on the page, regardless of position
PdfTextExtractor.getTextFromPage(reader, pageNum);

您实际上需要采用 TextExtractionStrategy(过滤策略)的覆盖.它很快变得有趣,但我认为你可以在这里开箱即用"获得你想要的一切.

You'd actually need the override that takes a TextExtractionStrategy, the filtered strategy. It gets interesting fairly quickly, but I think you can get everything you want here "out of the box".

  • 图片

是的,通过相同的包类.图像侦听器不像文本侦听器那样受支持,但确实存在.

Yep, via the same package classes. Image listeners aren't as well supported as text listeners, but do exist.

  • 链接

是的.链接是各种 PDF 页面的注释".找到它们很简单,只需遍历每个页面的注释数组"并挑选出链接注释即可.

Yes. Links are "annotations" to various PDF pages. Finding them is a simple matter of looping through each page's "annotations array" and picking out the link annotations.

PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
ArrayList<String> dests = new ArrayList<String>();
if (annots != null) {
  for (int i = 0; i < annots.size(); ++i) {
    PdfDictionary annotDict = annots.getAsDict(i);
    PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
    if (subType != null && PdfName.LINK.equals(subType)) {
      PdfDictionary action = annotDict.getAsDict(PdfName.A);
      if (action != null && PdfName.URI.equals(action.getAsName(PdfName.S)) {
        dests.add(action.getAsString(PdfName.URI).toString());
      } // else { its an internal link, meh }
    }
  }
}

您可以在此处找到 PDF 规范在这里.

  • 输入元素

肯定的.对于 XFA(LiveCycle Designer)或旧技术的AcroForm"表单,iText 可以找到所有字段及其值.

Definitely. For either XFA (LiveCycle Designer) or the older-tech "AcroForm" forms, iText can find all the fields, and their values.

AcroFields fields = myReader.getAcroFields();

Set<String> fieldNames = fields.getFields().keySet();
for (String fldName : fieldNames) {
  System.out.println( fldName + ": " + fields.getField( fldName ) );
}

多选列表不会得到很好的处理.对于空文本字段和按钮,您将在冒号后获得一个空格.没有太多信息......但这会让你开始.

Mutli-select lists wouldn't be handled all that well. You'll get a blank space after the colon for empty text fields and for buttons. None too informative... but that'll get you started.

  • 文档元标记,如标题、描述或作者

相当琐碎.是的.

Map<String, String> info = myPdfReader.getInfo();
System.out.println( info );

除了基本的作者/标题/等,还有一个相当复杂的 XML 模式,您可以通过 reader.getMetadata() 访问.

In addition to the basic author/title/etc, there's a fairly involved XML schema you can access via reader.getMetadata().

  • 只有标题

TextRenderFilter 可以根据您希望的任何标准忽略文本.根据您的评论,字体大小听起来很合适.

A TextRenderFilter can ignore text based on whatever criteria you wish. Font size sounds about right based on your comment.

这篇关于适用于 Java 的高级 PDF 解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆