如何在400多个PDF文件中搜索关键字? [英] How to search keywords in 400+ PDF files?

查看:946
本文介绍了如何在400多个PDF文件中搜索关键字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有400个或更多的PDF文件,它们一起形成一个文本.它就像一本书,一页一页地分开.我需要以编程方式能够在整个文本中搜索一些关键字.

I have like 400 or more PDF files that together form a single text. Its like a book separated page by page. I need to programatically be able to search some keywords over the whole text.

所以我的第一个问题是:最好先逐页搜索,或者先将所有PDF合并到一个大文件中,然后再执行搜索?

So my first question is: is it better to search page by page or join all the PDFs in one big file first and then perform the search?

第二个是:制作它的最佳方法是什么?那里已经有好的程序或库了吗?

The second one is: what is the best way to make it? Is there already any good program or library out there?

顺便说一句,我只使用PHP和Python.

By the way, I'm using PHP and Python, only.

推荐答案

使用 PyPdf ,如此处.

import pyPdf

def getPDFContent(path):
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
    # Iterate pages
    for i in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(i).extractText() + "\n"
    # Collapse whitespace
    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content

for f in filelist:
    print keyword in getPDFContent(f)

一个接一个地搜索它们会更快,更简单,因为您可以轻松地遍历所有文件并在每个文件上使用代码.

It is faster and much simpler to search them one by one, because you can then simply loop over all the files and use the code on every file.

这篇关于如何在400多个PDF文件中搜索关键字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆