PDF到文本工具或Java库? [英] PDF to text tool or Java library?

查看:90
本文介绍了PDF到文本工具或Java库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将PDF转换为普通文本(这是我们县注册商的投票声明)。文件很大(大约2000页),大多包含表格。一旦我把它变成文本,那么我将使用我正在编写的程序来解析它并将数据放入数据库中。
我在Adobe Reader中尝试过另存为文本功能,但它并不像我想的那样精确,特别是在将表格数据划分为CSV时。
那么,对工具或Java库有什么建议可以解决这个问题吗?

I need to convert a PDF to normal text (it's the "statement of votes" from our county registrar). The files are big (2000 pages or so) and mostly contain tables. Once I get it into text, then I'm going to use a program I'm writing to parse it and put the data into a database. I've tried the 'Save as text' function in Adobe Reader, but it is not as precise as I'd like it, especially in delimiting the table data into CSV. So, any recommendations for tools or Java libraries that would do the trick?

推荐答案

嗯,有 iText 。我对它只有有限的经验,但它似乎可以做你做的事情希望。

Well, there is iText. I have only limited experience with it, but it seems it can do what you want.

Apache PDFBox 肯定可以做到。它的网站提到PDF to text extraction作为其主要功能。有专门针对此的 ExtractText命令行工具源代码),基于其 PDFTextStripper class 。还有一个PDFBox 文本提取指南

Apache PDFBox surely can do it. Its site mentions "PDF to text extraction" as its top feature. There's an ExtractText command line tool specifically for this (source code), based on its PDFTextStripper class. And there's a PDFBox Text Extraction Guide, too!

这篇关于PDF到文本工具或Java库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆