使用python查找搜索字符串在pdf文档中位于哪一页上 [英] finding on which page a search string is located in a pdf document using python
问题描述
我可以使用哪些python软件包找出特定的搜索字符串"位于哪一页上?
Which python packages can I use to find out out on which page a specific "search string" is located ?
我研究了几个python pdf软件包,但不知道应该使用哪个软件包. PyPDF 似乎没有此功能,并且
I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?
更精确: 我有几个PDF文档,我想提取介于字符串"Begin"和字符串"End"之间的页面.
More precise: I have several PDF documents and I would like to extract pages which are between a string "Begin" and a string "End" .
推荐答案
我终于发现pyPDF可以提供帮助.我将其发布,以防它可以帮助其他人.
I finally figured out that pyPDF can help. I am posting it in case it can help somebody else.
(1)定位字符串的功能
def fnPDF_FindText(xFile, xString):
# xfile : the PDF file in which to look
# xString : the string to look for
import pyPdf, re
PageFound = -1
pdfDoc = pyPdf.PdfFileReader(file(xFile, "rb"))
for i in range(0, pdfDoc.getNumPages()):
content = ""
content += pdfDoc.getPage(i).extractText() + "\n"
content1 = content.encode('ascii', 'ignore').lower()
ResSearch = re.search(xString, content1)
if ResSearch is not None:
PageFound = i
break
return PageFound
(2)提取感兴趣页面的功能
def fnPDF_ExtractPages(xFileNameOriginal, xFileNameOutput, xPageStart, xPageEnd):
from pyPdf import PdfFileReader, PdfFileWriter
output = PdfFileWriter()
pdfOne = PdfFileReader(file(xFileNameOriginal, "rb"))
for i in range(xPageStart, xPageEnd):
output.addPage(pdfOne.getPage(i))
outputStream = file(xFileNameOutput, "wb")
output.write(outputStream)
outputStream.close()
我希望这对其他人有帮助
I hope this will be helpful to somebody else
这篇关于使用python查找搜索字符串在pdf文档中位于哪一页上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!