如何从PDF中提取文本? [英] How to extract text from a PDF?

查看:122
本文介绍了如何从PDF中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能推荐一个用于从PDF中提取文本和图像的库/API? 我们需要能够获取文档的已知区域中包含的文本,因此API将需要向我们提供页面上每个元素的位置信息.

Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.

我们希望该数据以xmljson格式输出.我们目前正在查看的 PdfTextStream 看起来不错,但是希望听听其他人的经验和建议.

We would like that data to be output in xml or json format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.

是否可以通过编程方式从pdf中提取文本(商业或免费)?

推荐答案

我得到了一个400页pdf文件,其中包含必须导入的数据表-幸运的是没有图像. Ghostscript 为我工作:

I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:

gswin64c -sDEVICE=txtwrite -o output.txt input.pdf

将输出文件分为带有标题等的页面,但是随后很容易编写一个应用程序以去除空白行等,并吸收所有30,000条记录. -dSIMPLE-dCOMPLEX在这种情况下没有区别.

The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.

这篇关于如何从PDF中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆