我如何从Perl中进行PDF文件的全文搜索? [英] How can I do a full-text search of PDF files from Perl?

查看:136
本文介绍了我如何从Perl中进行PDF文件的全文搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆PDF文件,我的Perl程序需要对它们进行全文搜索以返回包含特定字符串的文件。
到目前为止,我一直在使用它:

  my @search_results =`grep -i -l \$ string \* .pdf`; 

其中$ string是要查找的文本。
然而,这对大多数pdf来说都是失败的,因为文件格式显然不是ASCII。



我能做什么最简单?

澄清:
大约有300个pdf,其名字我不知道。 PDF ::核心可能是矫枉过正。我试图让pdftotext和grep与对方玩好,因为我不知道pdf的名字,我找不到合适的语法。



使用下面的Adam Bellaire的建议最终解决方案:

  @search_results =`for \ $(ls);做pdftotext \ $ i  -  | grep --label =\ $ i-i -l$ search_string; done`; 


解决方案

PerlMonks线程这里谈论这个问题。



看来,对于你的情况,获取 pdftotext (命令行工具)可能是最简单的,那么您可以执行如下操作:

  my @search_results =`pdftotext myfile.pdf  -  | grep -i -l \$ string \`; 


I have a bunch of PDF files and my Perl program needs to do a full-text search of them to return which ones contain a specific string. To date I have been using this:

my @search_results = `grep -i -l \"$string\" *.pdf`;

where $string is the text to look for. However this fails for most pdf's because the file format is obviously not ASCII.

What can I do that's easiest?

Clarification: There are about 300 pdf's whose name I do not know in advance. PDF::Core is probably overkill. I am trying to get pdftotext and grep to play nice with each other given I don't know the names of the pdf's, I can't find the right syntax yet.

Final solution using Adam Bellaire's suggestion below:

@search_results = `for i in \$( ls ); do pdftotext \$i - | grep --label="\$i" -i -l "$search_string"; done`;

解决方案

The PerlMonks thread here talks about this problem.

It seems that for your situation, it might be simplest to get pdftotext (the command line tool), then you can do something like:

my @search_results = `pdftotext myfile.pdf - | grep -i -l \"$string\"`;

这篇关于我如何从Perl中进行PDF文件的全文搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆