如何从Perl的PDF文件中提取文本? [英] How can I extract text from a PDF file in Perl?
问题描述
我正在尝试使用Perl从PDF文件中提取文本.我一直在命令行中使用pdftotext.exe
(即使用Perl system
函数)从PDF文件中提取文本,这种方法效果很好.
I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe
from command line (i.e using Perl system
function) for extracting text from PDF files, this method works fine.
问题是我们有&alpha ;、β等符号. PDF文件中未显示的其他特殊字符以及生成的txt文件中未显示的其他特殊字符.另外,在文本中随机添加了很少的额外空格.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
是否有更好,更可靠的方法从PDF文件中提取文本,从而使文本包含所有符号,例如α,β等等,文字将与PDF中的文字完全匹配(即没有多余的空格)吗?
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
推荐答案
这些模块可以实现从pdf提取文本
These modules you can acheive the extract text from pdf
来自CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
此模块尝试从PDF页面提取顺序文本.这不是一个健壮的过程,因为PDF文本以任意顺序以图形方式进行布局.该模块使用一些启发式方法来尝试猜测什么文本紧随其他文本之后,但是很容易被下标,非水平文本,字体更改,表单字段等蒙骗.
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
除了所有这些免责声明,它对于从简单的PDF文件中快速转储文本很有用.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
这篇关于如何从Perl的PDF文件中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!