使用python查找搜索字符串位于pdf文档中的哪个页面上 [英] finding on which page a search string is located in a pdf document using python
本文介绍了使用python查找搜索字符串位于pdf文档中的哪个页面上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我可以使用哪些 python 包来找出特定搜索字符串"位于哪个页面?
Which python packages can I use to find out out on which page a specific "search string" is located ?
我查看了几个 python pdf 包,但不知道应该使用哪一个.PyPDF 似乎没有这个功能,PDFMiner 对于这样简单的任务来说似乎有点矫枉过正.有什么建议吗?
I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?
更精确:我有几个 PDF 文档,我想提取介于字符串Begin"和字符串End"之间的页面.
More precise: I have several PDF documents and I would like to extract pages which are between a string "Begin" and a string "End" .
推荐答案
使用python在pdf文档中查找搜索字符串位于哪个页面
PyPDF2
# import packages
import PyPDF2
import re
# open the pdf file
object = PyPDF2.PdfFileReader(r"source_file_path")
# get number of pages
NumPages = object.getNumPages()
# define keyterms
String = "P4F-21B"
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
ResSearch = re.search(String, Text)
if ResSearch != None:
print(ResSearch)
print("Page Number" + str(i+1))
输出:
<re.Match object; span=(57, 64), match='P4F-21B'>
Page Number1
PyMuPDF
import fitz
import re
# load document
doc = fitz.open(r"C:Usersshraddha.shettyDesktopOCR-pages-deleted.pdf")
# define keyterms
String = "P4F-21B"
# get text, search for string and print count on page.
for page in doc:
text = ''
text += page.get_text()
if len(re.findall(String, text)) > 0:
print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')
这篇关于使用python查找搜索字符串位于pdf文档中的哪个页面上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文