从PDF中提取数据的最简单方法是什么? [英] What is the easiest way to extract data from a PDF?

查看:157
本文介绍了从PDF中提取数据的最简单方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从一些PDF文档中提取数据(使用Java)。我需要知道最简单的方法是什么。

I need to extract data from some PDF documents (using Java). I need to know what would be the easiest way to do it.

我试过iText。这对我的需求来说相当复杂。此外,我猜它不适用于商业项目。所以这不是一个选择。我还尝试了PDFBox,并遇到了各种 NoClassDefFoundError 错误。

I tried iText. It's fairly complicated for my needs. Besides I guess it is not available for free for commercial projects. So it is not an option. I also gave a try to PDFBox, and ran into various NoClassDefFoundError errors.

我用Google搜索并遇到了其他几个选项,例如PDF Clown,jPod,但我没有时间试验所有这些库。我依靠社区通过Java阅读PDF的经验。

I googled and came across several other options such as PDF Clown, jPod, but I do not have time to experiment with all of these libraries. I am relying on community's experience with PDF reading thru Java.

请注意,我不需要创建或操作PDF文档。我只需要从中等级别的布局复杂性中提取PDF文档中的文本数据。

Note that I do not need to create or manipulate PDF documents. I just need to exrtract textual data from PDF documents with moderate level layout complexity.

请建议从PDF文档中提取文本的最快捷最简单的方法。谢谢。

Please suggest the quickest and easiest way to extract text from PDF documents. Thanks.

推荐答案

我建议尝试 Apache Tika 。 Apache Tika基本上是一个工具包,可以从许多类型的文档中提取数据,包括PDF。

I recommend trying Apache Tika. Apache Tika is basically a toolkit that extracts data from many types of documents, including PDFs.

Tika的好处(除了免费),曾经是Apache Lucene的一个子项目,它是一个非常强大的开源搜索引擎。 Tika包含一个内置的PDF解析器,它使用SAX内容处理程序将PDF数据传递给您的应用程序。它还可以从加密的PDF中提取数据,它允许您创建或子类化现有的解析器来自定义行为。

The benefits of Tika (besides being free), is that is used to be a subproject of Apache Lucene, which is a very robust open-source search engine. Tika includes a built-in PDF parser that uses a SAX Content Handler to pass PDF data to your application. It can also extract data from encrypted PDFs and it allows you to create or subclass an existing parser to customize the behavior.

代码很简单。要从PDF中提取数据,您需要做的就是创建一个实现Parser接口的Parser类并定义一个parse()方法:

The code is simple. To extract the data from a PDF, all you need to do is create a Parser class that implements the Parser interface and define a parse() method:

public void parse(
   InputStream stream, ContentHandler handler,
   Metadata metadata, ParseContext context)
   throws IOException, SAXException, TikaException {

   metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
   metadata.set("Hello", "World");

   XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
   xhtml.startDocument();
   xhtml.endDocument();
}

然后,要运行解析器,你可以这样做:

Then, to run the parser, you could do something like this:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

这篇关于从PDF中提取数据的最简单方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆