Python阅读了pdf页面的一部分 [英] Python read part of a pdf page
问题描述
我正在尝试阅读pdf文件,其中每个页面分为3x3信息块形式,
I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form
A | B | C
D | E | F
G | H | I
每个条目都分成多行.一个条目的简化示例是此卡.但是其他8个插槽中会有类似的条目.
Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar entries in the other 8 slots.
我看过pdfminer和pypdf2.我还没有发现pdfminer太有用了,但是pypdf2给了我一些好处.
I've looked at pdfminer and pypdf2. I haven't found pdfminer overly useful, but pypdf2 has given me something close.
import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
content = ""
p = file(path, "rb")
pdf = PyPDF2.PdfFileReader(p)
numPages = pdf.getNumPages()
for i in range(numPages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
但是,这只会逐行读取文件.我想要一种解决方案,其中我只能读取页面的一部分,以便可以读取A,然后读取B,然后读取C,依此类推.另外,答案此处效果很好,但是顺序的
列通常会失真,我只能逐行读取它.
However, this only reads the file line by line. I'd like a solution where I can read only a portion of the page so that I could read A, then B, then C, and so on. Also, the answer here works fairly well, but the order of
columns routinely gets distorted and I've only gotten it to read line by line.
推荐答案
假设您使用的是pdfminer
和pypdf2
.如果您知道以英寸为单位的列和行的大小,则可以使用 minecart
(完整披露:我写道minecart
).示例代码:
I assume the PDF files in question are generated PDFs rather than scanned (as in the example you gave), given that you're using pdfminer
and pypdf2
. If you know the size of the columns and rows in inches you can use minecart
(full disclosure: I wrote minecart
). Example code:
import minecart
# minecart units are 1/72 inch, measured from bottom-left of the page
ROW_BORDERS = (
72 * 1, # Bottom row starts 1 inch from the bottom of the page
72 * 3, # Second row starts 3 inches from the bottom of the page
72 * 5, # Third row starts 5 inches from the bottom of the page
72 * 7, # Third row ends 7 inches from the bottom of the page
)
COLUMN_BORDERS = (
72 * 8, # Third col ends 8 inches from the left of the page
72 * 6, # Third col starts 6 inches from the left of the page
72 * 4, # Second col starts 4 inches from the left of the page
72 * 2, # First col starts 2 inches from the left of the page
) # reversed so that BOXES is ordered properly
BOXES = [
(left, bot, right, top)
for top, bot in zip(ROW_BORDERS, ROW_BORDERS[1:])
for left, right in zip(COLUMN_BORDERS, COLUMN_BORDERS[1:])
]
def extract_output(page):
"""
Reads the text from page and splits it into the 9 cells.
Returns a list with 9 entries:
[A, B, C, D, E, F, G, H, I]
Each item in the tuple contains a string with all of the
text found in the cell.
"""
res = []
for box in BOXES:
strings = list(page.letterings.iter_in_bbox(box))
# We sort from top-to-bottom and then from left-to-right, based
# on the strings' top left corner
strings.sort(key=lambda x: (-x.bbox[3], x.bbox[0]))
res.append(" ".join(strings).replace(u"\xa0", " ").strip())
return res
content = []
doc = minecart.Document(open("path/to/pdf-doc.pdf", 'rb'))
for page in doc.iter_pages():
content.append(extract_output(page))
这篇关于Python阅读了pdf页面的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!