如何使用Python从PDF提取表格作为文本? [英] How to extract table as text from the PDF using Python?

查看:108
本文介绍了如何使用Python从PDF提取表格作为文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PDF,其中包含表格,文本和一些图像.我想提取PDF中任何表格所在的表格.

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

现在正在手动执行以从页面中查找表.从那里,我捕获该页面并保存为另一个PDF.

Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.

import PyPDF2

PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored

pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

pg4 = pfr.getPage(126) #extract pg 127

writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)

NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
    writer.write(outputStream) #write pages to new PDF

我的目标是从整个PDF文档中提取表格.

My goal is to extract the table from the whole PDF document.

推荐答案

此答案适用于遇到带有图像的pdf并需要使用OCR的任何人.我找不到可行的现成解决方案;没有什么可以让我达到我所需要的精度.

This answer is for anyone encountering pdfs with images and needing to use OCR. I could not find a workable off-the-shelf solution; nothing that gave me the accuracy I needed.

这是我发现可以使用的步骤.

Here are the steps I found to work.

  1. 使用 https://poppler.freedesktop.org/中的pdfimages进行转向pdf页面转换为图像.

  1. Use pdfimages from https://poppler.freedesktop.org/ to turn the pages of the pdf into images.

使用 Tesseract 检测旋转并

Use Tesseract to detect rotation and ImageMagick mogrify to fix it.

使用OpenCV查找和提取表.

Use OpenCV to find and extract tables.

使用OpenCV从表中查找并提取每个单元格.

Use OpenCV to find and extract each cell from the table.

使用OpenCV裁剪并清理每个单元格,以确保不会使OCR软件感到困惑.

Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.

使用Tesseract对每个单元格进行OCR.

Use Tesseract to OCR each cell.

将每个单元格的提取文本组合为所需的格式.

Combine the extracted text of each cell into the format you need.

我写了一个python软件包,其中包含可以帮助完成这些步骤的模块.

I wrote a python package with modules that can help with those steps.

回购: https://github.com/eihli/image-table-ocr

文档和来源: https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

某些步骤不需要代码,它们利用了pdfimagestesseract之类的外部工具.我将为需要代码的几个步骤提供一些简短的示例.

Some of the steps don't require code, they take advantage of external tools like pdfimages and tesseract. I'll provide some brief examples for a couple of the steps that do require code.

  1. 查找表:

在弄清楚如何查找表时,此链接是一个很好的参考. https://answers.opencv.org /question/63847/如何从一张图片中提取表格/

This link was a good reference while figuring out how to find tables. https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

import cv2

def find_tables(image):
    BLUR_KERNEL_SIZE = (17, 17)
    STD_DEV_X_DIRECTION = 0
    STD_DEV_Y_DIRECTION = 0
    blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
    MAX_COLOR_VAL = 255
    BLOCK_SIZE = 15
    SUBTRACT_FROM_MEAN = -2

    img_bin = cv2.adaptiveThreshold(
        ~blurred,
        MAX_COLOR_VAL,
        cv2.ADAPTIVE_THRESH_MEAN_C,
        cv2.THRESH_BINARY,
        BLOCK_SIZE,
        SUBTRACT_FROM_MEAN,
    )
    vertical = horizontal = img_bin.copy()
    SCALE = 5
    image_width, image_height = horizontal.shape
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
    horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
    vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)

    horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
    vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))

    mask = horizontally_dilated + vertically_dilated
    contours, hierarchy = cv2.findContours(
        mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE,
    )

    MIN_TABLE_AREA = 1e5
    contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA]
    perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
    epsilons = [0.1 * p for p in perimeter_lengths]
    approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
    bounding_rects = [cv2.boundingRect(a) for a in approx_polys]

    # The link where a lot of this code was borrowed from recommends an
    # additional step to check the number of "joints" inside this bounding rectangle.
    # A table should have a lot of intersections. We might have a rectangular image
    # here though which would only have 4 intersections, 1 at each corner.
    # Leaving that step as a future TODO if it is ever necessary.
    images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
    return images

  1. 从表格中提取单元格.

这与2非常相似,因此我不会包含所有代码.我将参考的部分是对单元格进行排序.

This is very similar to 2, so I won't include all the code. The part I will reference will be in sorting the cells.

我们要从左到右,从上到下识别单元格.

We want to identify the cells from left-to-right, top-to-bottom.

我们会找到最左上角的矩形.然后,我们将找到所有中心位于该左上角矩形的top-y和bottom-y值内的矩形.然后,我们将根据其矩形的x值对这些矩形进行排序.我们将从列表中删除这些矩形,然后重复.

We’ll find the rectangle with the most top-left corner. Then we’ll find all of the rectangles that have a center that is within the top-y and bottom-y values of that top-left rectangle. Then we’ll sort those rectangles by the x value of their center. We’ll remove those rectangles from the list and repeat.

def cell_in_same_row(c1, c2):
    c1_center = c1[1] + c1[3] - c1[3] / 2
    c2_bottom = c2[1] + c2[3]
    c2_top = c2[1]
    return c2_top < c1_center < c2_bottom

orig_cells = [c for c in cells]
rows = []
while cells:
    first = cells[0]
    rest = cells[1:]
    cells_in_same_row = sorted(
        [
            c for c in rest
            if cell_in_same_row(c, first)
        ],
        key=lambda c: c[0]
    )

    row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
    rows.append(row_cells)
    cells = [
        c for c in rest
        if not cell_in_same_row(c, first)
    ]

# Sort rows by average height of their center.
def avg_height_of_center(row):
    centers = [y + h - h / 2 for x, y, w, h in row]
    return sum(centers) / len(centers)

rows.sort(key=avg_height_of_center)

这篇关于如何使用Python从PDF提取表格作为文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆