如何从Perl的PDF文件中提取文本? [英] How can I extract text from a PDF file in Perl?

查看:72
本文介绍了如何从Perl的PDF文件中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Perl从PDF文件中提取文本.我一直在命令行中使用pdftotext.exe(即使用Perl system函数)从PDF文件中提取文本,这种方法效果很好.

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.

问题是我们有&alpha ;、β等符号. PDF文件中未显示的其他特殊字符以及生成的txt文件中未显示的其他特殊字符.另外,在文本中随机添加了很少的额外空格.

The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.

是否有更好,更可靠的方法从PDF文件中提取文本,从而使文本包含所有符号,例如α,β等等,文字将与PDF中的文字完全匹配(即没有多余的空格)吗?

Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

推荐答案

这些模块可以实现从pdf提取文本

These modules you can acheive the extract text from pdf

PDF :: API2

CAM :: PDF

CAM :: PDF :: PageText

来自CPAN

   my $pdf = CAM::PDF->new($filename);
   my $pageone_tree = $pdf->getPageContentTree(1);
   print CAM::PDF::PageText->render($pageone_tree);

此模块尝试从PDF页面提取顺序文本.这不是一个健壮的过程,因为PDF文本以任意顺序以图形方式进行布局.该模块使用一些启发式方法来尝试猜测什么文本紧随其他文本之后,但是很容易被下标,非水平文本,字体更改,表单字段等蒙骗.

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

除了所有这些免责声明,它对于从简单的PDF文件中快速转储文本很有用.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

这篇关于如何从Perl的PDF文件中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆