如何在 python 3 中使用 PDFminer.six? [英] How to use PDFminer.six with python 3?

查看:97
本文介绍了如何在 python 3 中使用 PDFminer.six?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 pdfminer.six 这是一个工具,它可以与 Python3 一起用于从 PDF 文档中提取信息.问题是根本没有好的文档,也没有关于如何使用该工具的源代码示例.

I want to use pdfminer.six which is a tool, that can be used with Python3 for extracting information from PDF documents. The problem is there is no good documentation at all and no source code example on how to use the tool.

我已经尝试过 StackOverflow 中的一些代码,但没有奏效.下面是我的代码.

I have already tried some code from StackOverflow but it didn't work. Below is my code.

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

我想要一些关于如何使用此工具从 PDF 中获取数据的代码示例.

I want some code example on how to use this tool to get data from PDFs.

推荐答案

安装 pdfminer.six 或 pdfminer3 (https://github.com/gwk/pdfminer3/)安装:pip install pdfminer3我从 3.6 升级到 3.7 时切换到 pdfminer3我在 ubuntu 和 macos 上使用 python 3.7.3

Install pdfminer.six or pdfminer3 (https://github.com/gwk/pdfminer3/) install: pip install pdfminer3 I switched to pdfminer3 when I upgraded to 3.7 from 3.6 I use on ubuntu and macos with python 3.7.3

pdfminer3 自带两个方便的工具:pdf2txt.py 和 dumppdf.py检查来源.相当小且易于理解.

pdfminer3 comes with two handy tools: pdf2txt.py and dumppdf.py examine the source. Fairly small and easy to understand.

以下是一个工作示例(一旦添加了 pdf 文件的位置)

Following is a working example (once the location of the pdf file is added)

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

with open('/path/to/file.pdf', 'rb') as fh:

    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)

这篇关于如何在 python 3 中使用 PDFminer.six?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆