Python阅读了pdf页面的一部分 [英] Python read part of a pdf page

查看:151
本文介绍了Python阅读了pdf页面的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试阅读pdf文件,其中每个页面分为3x3信息块形式,

I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form

A | B | C
D | E | F
G | H | I

每个条目都分成多行.一个条目的简化示例是此卡.但是其他8个插槽中会有类似的条目.

Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar entries in the other 8 slots.

我看过pdfminer和pypdf2.我还没有发现pdfminer太有用了,但是pypdf2给了我一些好处.

I've looked at pdfminer and pypdf2. I haven't found pdfminer overly useful, but pypdf2 has given me something close.

import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
    content = ""
    p = file(path, "rb")
    pdf = PyPDF2.PdfFileReader(p)
    numPages = pdf.getNumPages()
    for i in range(numPages):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

但是,这只会逐行读取文件.我想要一种解决方案,其中我只能读取页面的一部分,以便可以读取A,然后读取B,然后读取C,依此类推.另外,答案此处效果很好,但是顺序的
列通常会失真,我只能逐行读取它.

However, this only reads the file line by line. I'd like a solution where I can read only a portion of the page so that I could read A, then B, then C, and so on. Also, the answer here works fairly well, but the order of
columns routinely gets distorted and I've only gotten it to read line by line.

推荐答案

假设您使用的是pdfminerpypdf2.如果您知道以英寸为单位的列和行的大小,则可以使用 minecart (完整披露:我写道minecart).示例代码:

I assume the PDF files in question are generated PDFs rather than scanned (as in the example you gave), given that you're using pdfminer and pypdf2. If you know the size of the columns and rows in inches you can use minecart (full disclosure: I wrote minecart). Example code:

import minecart

# minecart units are 1/72 inch, measured from bottom-left of the page
ROW_BORDERS = (
    72 * 1,  # Bottom row starts 1 inch from the bottom of the page
    72 * 3,  # Second row starts 3 inches from the bottom of the page
    72 * 5,  # Third row starts 5 inches from the bottom of the page
    72 * 7,  # Third row ends 7 inches from the bottom of the page
)
COLUMN_BORDERS = (
    72 * 8,  # Third col ends 8 inches from the left of the page
    72 * 6,  # Third col starts 6 inches from the left of the page
    72 * 4,  # Second col starts 4 inches from the left of the page   
    72 * 2,  # First col starts 2 inches from the left of the page
)  # reversed so that BOXES is ordered properly
BOXES = [
    (left, bot, right, top)
    for top, bot in zip(ROW_BORDERS, ROW_BORDERS[1:])
    for left, right in zip(COLUMN_BORDERS, COLUMN_BORDERS[1:])
]

def extract_output(page):
    """
    Reads the text from page and splits it into the 9 cells.

    Returns a list with 9 entries: 

        [A, B, C, D, E, F, G, H, I]

    Each item in the tuple contains a string with all of the
    text found in the cell.

    """
    res = []
    for box in BOXES:
        strings = list(page.letterings.iter_in_bbox(box))
        # We sort from top-to-bottom and then from left-to-right, based
        # on the strings' top left corner
        strings.sort(key=lambda x: (-x.bbox[3], x.bbox[0]))
        res.append(" ".join(strings).replace(u"\xa0", " ").strip())
    return res

content = []
doc = minecart.Document(open("path/to/pdf-doc.pdf", 'rb'))
for page in doc.iter_pages():
    content.append(extract_output(page))

这篇关于Python阅读了pdf页面的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆