使用python查找搜索字符串位于pdf文档中的哪个页面上 [英] finding on which page a search string is located in a pdf document using python

查看:22
本文介绍了使用python查找搜索字符串位于pdf文档中的哪个页面上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用哪些 python 包来找出特定搜索字符串"位于哪个页面?

Which python packages can I use to find out out on which page a specific "search string" is located ?

我查看了几个 python pdf 包,但不知道应该使用哪一个.PyPDF 似乎没有这个功能,PDFMiner 对于这样简单的任务来说似乎有点矫枉过正.有什么建议吗?

I looked into several python pdf packages but couldn't figure out which one I should use. PyPDF does not seem to have this functionality and PDFMiner seems to be an overkill for such simple task. Any advice ?

更精确:我有几个 PDF 文档,我想提取介于字符串Begin"和字符串End"之间的页面.

More precise: I have several PDF documents and I would like to extract pages which are between a string "Begin" and a string "End" .

推荐答案

使用python在pdf文档中查找搜索字符串位于哪个页面

PyPDF2

 # import packages
    import PyPDF2
    import re
    
    # open the pdf file
    object = PyPDF2.PdfFileReader(r"source_file_path")
    
    # get number of pages
    NumPages = object.getNumPages()
    
    # define keyterms
    String = "P4F-21B"
    
    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        Text = PageObj.extractText()
        ResSearch = re.search(String, Text)
        if ResSearch != None:
            print(ResSearch)
            print("Page Number" + str(i+1))

输出:

<re.Match object; span=(57, 64), match='P4F-21B'>
Page Number1

PyMuPDF

import fitz
import re

# load document
doc = fitz.open(r"C:Usersshraddha.shettyDesktopOCR-pages-deleted.pdf")

# define keyterms
String = "P4F-21B"

# get text, search for string and print count on page.
for page in doc:
    text = ''
    text += page.get_text()
    if len(re.findall(String, text)) > 0:
        print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')

这篇关于使用python查找搜索字符串位于pdf文档中的哪个页面上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆