用于从PDF搜索文本的脚本 [英] Script to search for text from PDF

查看:112
本文介绍了用于从PDF搜索文本的脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Mac OS X平台上,我想用Python或Tcl编写脚本,以在PDF文件中搜索文本并提取相关部分.感谢您的帮助.

On the Mac OS X platform, I would like to write a script, either in Python or Tcl to search for text within a PDF file and extract the relevant parts. I appreciate any help.

我正在编写脚本以查看PDF的内部内容,以确定它是不是账单,来自哪个公司,在什么时期内.根据这些信息,我将PDF重命名并将其移至适当的目录.例如,诸如Statement_03948293929384.pdf的文件可能会变成2012-07-15 Water Bill.pdf并移动到我的Utilities文件夹中.

I am writing scripts to look inside a PDF to determine if it is a bill, from what company, and for what period. Based on these information, I rename the PDF and move it to an appropriate directory. For example, file such as Statement_03948293929384.pdf might become 2012-07-15 Water Bill.pdf and moved to my Utilities folder.

  • 我已经搜索了PDF到纯文本工具,但还没有找到任何东西
  • 我查看了Tcl Wiki,找到了一个示例,但无法使其正常工作(我搜索了PDF文本,但未找到).
  • 我正在研究Didier Stevens的pdf-parser.py
  • 我听说过一个名为pyPdf的Python包,接下来将对其进行研究.
  • I have searched for PDF-to-plain-text tools, but not found anything yet
  • I have looked into the Tcl wiki and found an example, but could not get it to work (I searched for text in PDF, but not found).
  • I am looking into pdf-parser.py by Didier Stevens
  • I heard of a Python package called pyPdf and will look at it next.

我找到了一个由Glyph&所写的名为 pdftotext 的命令行工具. Cog,LLC;由 Carsten Bluem 构建和打包.这个工具很简单,它解决了我的问题.我仍然在寻找那些可以直接搜索PDF而无需转换为文本文件的工具.

I have found a command-line tool called pdftotext written by Glyph & Cog, LLC; built and packaged by Carsten Bluem. This tool is straight forward and it solves my problem. I am still looking out for those tools that can search PDF directly, without having to convert to text file.

推荐答案

我已成功使用 PyODConverter 转换为PDF/从PDF转换(还有更强大的Java版本).将PDF转换为文本后,进行搜索应该很简单.我也相信 iText 应该能够执行类似的操作,但我尚未对其进行测试.

I have successfully used PyODConverter to convert to/from PDFs (there is also a more powerful Java version). Once you have the PDF converted to text it should be trivial to do the searching. Also I believe iText should be capable of doing similar things, but I haven't tested it.

这篇关于用于从PDF搜索文本的脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆