使用python查找搜索字符串在pdf文档中位于哪一页上 [英] finding on which page a search string is located in a pdf document using python

查看:558
本文介绍了使用python查找搜索字符串在pdf文档中位于哪一页上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用哪些python软件包找出特定的搜索字符串"位于哪一页上?

Which python packages can I use to find out out on which page a specific "search string" is located ?

我研究了几个python pdf软件包,但不知道应该使用哪个软件包. PyPDF 似乎没有此功能,并且

I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?

更精确: 我有几个PDF文档,我想提取介于字符串"Begin"和字符串"End"之间的页面.

More precise: I have several PDF documents and I would like to extract pages which are between a string "Begin" and a string "End" .

推荐答案

我终于发现pyPDF可以提供帮助.我将其发布,以防它可以帮助其他人.

I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.

(1)定位字符串的功能

def fnPDF_FindText(xFile, xString):
    # xfile : the PDF file in which to look
    # xString : the string to look for
    import pyPdf, re
    PageFound = -1
    pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
    for i in range(0, pdfDoc.getNumPages()):
        content = ""
        content += pdfDoc.getPage(i).extractText() + "\n"
        content1 = content.encode('ascii', 'ignore').lower()
        ResSearch = re.search(xString, content1)
        if ResSearch is not None:
           PageFound = i
           break
     return PageFound

(2)提取感兴趣页面的功能

  def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
      from pyPdf import PdfFileReader, PdfFileWriter
      output = PdfFileWriter()
      pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
      for i in range(xPageStart, xPageEnd):
          output.addPage(pdfOne.getPage(i))
          outputStream = file(xFileNameOutput, "wb")
          output.write(outputStream)
          outputStream.close()

我希望这对其他人有帮助

I hope this will be helpful to somebody else

这篇关于使用python查找搜索字符串在pdf文档中位于哪一页上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆