如何从Perl的PDF文件中提取文本? [英] How can I extract text from a PDF file in Perl?

查看：72 发布时间：2020/5/25 3:49:33 perl pdf text extract

本文介绍了如何从Perl的PDF文件中提取文本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Perl从PDF文件中提取文本.我一直在命令行中使用pdftotext.exe(即使用Perl system函数)从PDF文件中提取文本，这种方法效果很好.

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.

问题是我们有&alpha ;、β等符号. PDF文件中未显示的其他特殊字符以及生成的txt文件中未显示的其他特殊字符.另外，在文本中随机添加了很少的额外空格.

The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.

是否有更好，更可靠的方法从PDF文件中提取文本，从而使文本包含所有符号，例如α，β等等，文字将与PDF中的文字完全匹配(即没有多余的空格)吗?

Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?

推荐答案

这些模块可以实现从pdf提取文本

These modules you can acheive the extract text from pdf

PDF :: API2

CAM :: PDF

CAM :: PDF :: PageText

来自CPAN

   my $pdf = CAM::PDF->new($filename);
   my $pageone_tree = $pdf->getPageContentTree(1);
   print CAM::PDF::PageText->render($pageone_tree);

此模块尝试从PDF页面提取顺序文本.这不是一个健壮的过程，因为PDF文本以任意顺序以图形方式进行布局.该模块使用一些启发式方法来尝试猜测什么文本紧随其他文本之后，但是很容易被下标，非水平文本，字体更改，表单字段等蒙骗.

This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

除了所有这些免责声明，它对于从简单的PDF文件中快速转储文本很有用.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

这篇关于如何从Perl的PDF文件中提取文本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从Perl的PDF文件中提取文本? [英] How can I extract text from a PDF file in Perl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从Perl的PDF文件中提取文本? [英] How can I extract text from a PDF file in Perl?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭