从pdf提取文本的最佳perl模块是什么? [英] What is the best perl module to extract text from a pdf?
问题描述
从pdf提取文本的最佳方法是什么?
What is the best way to extract text from a pdf?
推荐答案
CAM :: PDF 模块对于提取文本和维护有关其在文档中的位置的某些信息非常有用.它安装了/usr/local/bin/getpdftext.pl,它演示了简单的提取过程.但是,CAM :: PDF只能读取完全有效的PDF.
如果要处理格式错误的PDF,则可能需要更宽松的解析器,例如pdftotext.它将foo.pdf转储到foo.txt,然后您可以将其读入Perl.
The CAM::PDF module is pretty useful for extracting text and maintaining some information about where it came from in the document. It installs /usr/local/bin/getpdftext.pl which demonstrates simple extraction. However, CAM::PDF can only read PDFs that are completely valid.
If you are dealing with ill-formed PDFs, you may need a more lenient parser, such as pdftotext. It dumps foo.pdf to foo.txt, which you could then read into Perl.
这篇关于从pdf提取文本的最佳perl模块是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!