PDF到文本工具或Java库? [英] PDF to text tool or Java library?
问题描述
我需要将PDF转换为普通文本(这是我们县注册商的投票声明)。文件很大(大约2000页),大多包含表格。一旦我把它变成文本,那么我将使用我正在编写的程序来解析它并将数据放入数据库中。
我在Adobe Reader中尝试过另存为文本功能,但它并不像我想的那样精确,特别是在将表格数据划分为CSV时。
那么,对工具或Java库有什么建议可以解决这个问题吗?
I need to convert a PDF to normal text (it's the "statement of votes" from our county registrar). The files are big (2000 pages or so) and mostly contain tables. Once I get it into text, then I'm going to use a program I'm writing to parse it and put the data into a database. I've tried the 'Save as text' function in Adobe Reader, but it is not as precise as I'd like it, especially in delimiting the table data into CSV. So, any recommendations for tools or Java libraries that would do the trick?
推荐答案
嗯,有 iText 。我对它只有有限的经验,但它似乎可以做你做的事情希望。
Well, there is iText. I have only limited experience with it, but it seems it can do what you want.
Apache PDFBox 肯定可以做到。它的网站提到PDF to text extraction作为其主要功能。有专门针对此的 ExtractText命令行工具(源代码),基于其 PDFTextStripper class 。还有一个PDFBox 文本提取指南!
Apache PDFBox surely can do it. Its site mentions "PDF to text extraction" as its top feature. There's an ExtractText command line tool specifically for this (source code), based on its PDFTextStripper class. And there's a PDFBox Text Extraction Guide, too!
这篇关于PDF到文本工具或Java库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!