如何使用python仅从PDF文件中提取特定文本 [英] How to extract only specific text from PDF file using python

查看:44
本文介绍了如何使用python仅从PDF文件中提取特定文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用python仅从PDF文件中提取某些特定文本并将输出数据存储到Excel的特定列中.

这是示例输入 PDF 文件 (File.pdf)

链接到完整的 PDF 文件

我们需要从整个 PDF 文件中提取发票编号、到期日和总到期日值.

到目前为止我使用过的脚本:

 from io import StringIO从 pdfminer.converter 导入 TextConverter从 pdfminer.layout 导入 LAParams从 pdfminer.pdfdocument 导入 PDFDocument从 pdfminer.pdfinterp 导入 PDFResourceManager, PDFPageInterpreter从 pdfminer.pdfpage 导入 PDFPage从 pdfminer.pdfparser 导入 PDFParseroutput_string = StringIO()使用 open('file.pdf', 'rb') 作为 in_file:解析器 = PDFParser(in_file)doc = PDFDocument(解析器)rsrcmgr = PDFResourceManager()device = TextConverter(rsrcmgr, output_string, laparams=LAParams())解释器 = PDFPageInterpreter(rsrcmgr,设备)对于 PDFPage.create_pages(doc) 中的页面:解释器.process_page(页面)打印(输出字符串.getvalue())

但没有从 PDF 文件中获取特定的输出值.

解决方案

如果你想以你的方式(pdfminer)找到数据,你可以搜索一个模式来提取数据,如下所示(新的是正则表达式最后,根据您给定的数据):

from io import StringIO进口重新从 pdfminer.converter 导入 TextConverter从 pdfminer.layout 导入 LAParams从 pdfminer.pdfdocument 导入 PDFDocument从 pdfminer.pdfinterp 导入 PDFResourceManager, PDFPageInterpreter从 pdfminer.pdfpage 导入 PDFPage从 pdfminer.pdfparser 导入 PDFParseroutput_string = StringIO()使用 open('testfile.pdf', 'rb') 作为 in_file:解析器 = PDFParser(in_file)doc = PDFDocument(解析器)rsrcmgr = PDFResourceManager()device = TextConverter(rsrcmgr, output_string, laparams=LAParams())解释器 = PDFPageInterpreter(rsrcmgr,设备)对于 PDFPage.create_pages(doc) 中的页面:解释器.process_page(页面)find = re.search(r"INV-\d+\n\d+\n.+\n.+\n\$\d+\.\d+", output_string.getvalue())invoice_no, order_no, _, due_date, total_due = tracking.group(0).split("\n")打印(发票号,订单号,到期日,总到期日)

如果您想将数据存储在 excel 中,您可能需要更具体(或打开一个新问题)或查看这些页面:

写入 Excel 电子表格

https://www.geeksforgeeks.org/writing-excel-sheet-using-python/

https://xlsxwriter.readthedocs.io/

PS:另一个答案看起来不错,你只需要过滤数据

第二个解决方案.在这里,我使用另一个包 PyPDF2,因为在那里您以其他顺序获取数据(也许 PDFMiner 也可以这样做).如果数值前的文字始终相同,则可以找到这样的数据:

导入重新导入 PyPDF2def parse_pdf() ->列表:使用 open("testfile.pdf", "rb") 作为文件:fr = PyPDF2.PdfFileReader(文件)数据 = fr.getPage(0).extractText()regex_invoice_no = re.compile(r"Invoice Number\s*(INV-\d+)")regex_order_no = re.compile(r"Order Number(\d+)")regex_invoice_date = re.compile(r"Invoice Date(\S+ \d{1,2}, \d{4})")regex_due_date = re.compile(r"Due Date(\S+ \d{1,2}, \d{4})")regex_total_due = re.compile(r"Total Due(\$\d+\.\d{1,2})")invoice_no = re.search(regex_invoice_no, data).group(1)order_no = re.search(regex_order_no, data).group(1)invoice_date = re.search(regex_invoice_date, data).group(1)Due_date = re.search(regex_due_date, 数据).group(1)total_due = re.search(regex_total_due, 数据).group(1)返回 [invoice_no、due_date、total_due]如果 __name__ == '__main__':打印(parse_pdf())

也许您必须更改正则表达式,因为它们仅基于给定的示例.正则表达式仅在找到正则表达式时才有效,因此您必须使用 try: except 每个正则表达式 ;)
如果这不能回答您的问题,您必须提供更多信息/示例 pdf.

How to extract some of the specific text only from PDF files using python and store the output data into particular columns of Excel.

Here is the sample input PDF file (File.pdf)

Link to the full PDF file File.pdf

We need to extract the value of Invoice Number, Due Date and Total Due from the whole PDF file.

Script i have used so far:

    from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('file.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

But not getting the specific output value from the PDF file .

解决方案

If you want to find the data in in your way (pdfminer), you can search for a pattern to extract the data like the following (new is the regex at the end, based on your given data):

from io import StringIO
import re

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('testfile.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

finding = re.search(r"INV-\d+\n\d+\n.+\n.+\n\$\d+\.\d+", output_string.getvalue())

invoice_no, order_no, _, due_date, total_due = finding.group(0).split("\n")

print(invoice_no, order_no, due_date, total_due)

If you want to store the data in excel, you may have to be more specific (or open a new question) or look on these pages:

Writing to an Excel spreadsheet

https://www.geeksforgeeks.org/writing-excel-sheet-using-python/

https://xlsxwriter.readthedocs.io/

PS: the other answer looks like a good solution, you only have to filter the data

EDIT: Second solution. Here I use another package PyPDF2, because there you get the data in an other order (maybe this is possible with PDFMiner, too). If the text before the values are always the same, you can find the data like this:

import re
import PyPDF2

def parse_pdf() -> list:
    with open("testfile.pdf", "rb") as file:
        fr = PyPDF2.PdfFileReader(file)
        data = fr.getPage(0).extractText()

    regex_invoice_no = re.compile(r"Invoice Number\s*(INV-\d+)")
    regex_order_no = re.compile(r"Order Number(\d+)")
    regex_invoice_date = re.compile(r"Invoice Date(\S+ \d{1,2}, \d{4})")
    regex_due_date = re.compile(r"Due Date(\S+ \d{1,2}, \d{4})")
    regex_total_due = re.compile(r"Total Due(\$\d+\.\d{1,2})")

    invoice_no = re.search(regex_invoice_no, data).group(1)
    order_no = re.search(regex_order_no, data).group(1)
    invoice_date = re.search(regex_invoice_date, data).group(1)
    due_date = re.search(regex_due_date, data).group(1)
    total_due = re.search(regex_total_due, data).group(1)

    return [invoice_no, due_date, total_due]


if __name__ == '__main__':
    print(parse_pdf())

Maybe you have to change the regexes, because they are only based on the given example. The regexes are only working if they find the regex, so you have to work with try: except per regex ;)
If this does not answer your question, you have to provide more information/example pdfs.

这篇关于如何使用python仅从PDF文件中提取特定文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆