如何使用 Python 从 PDF 中提取表格作为文本? [英] How to extract table as text from the PDF using Python?

查看:25
本文介绍了如何使用 Python 从 PDF 中提取表格作为文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含表格、文本和一些图像的 PDF.我想在 PDF 中有表格的地方提取表格.

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF.

现在正在手动从页面中查找表格.从那里我捕获该页面并保存到另一个 PDF 中.

Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another PDF.

import PyPDF2

PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored

pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

pg4 = pfr.getPage(126) #extract pg 127

writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)

NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
    writer.write(outputStream) #write pages to new PDF

我的目标是从整个 PDF 文档中提取表格.

My goal is to extract the table from the whole PDF document.

推荐答案

这个答案适用于遇到带有图像的 pdf 文件并需要使用 OCR 的任何人.我找不到可行的现成解决方案;没有什么可以给我所需的准确性.

This answer is for anyone encountering pdfs with images and needing to use OCR. I could not find a workable off-the-shelf solution; nothing that gave me the accuracy I needed.

以下是我发现可行的步骤.

Here are the steps I found to work.

  1. 使用 https://poppler.freedesktop.org/中的 pdfimagesa> 将 pdf 的页面转换为图像.

  1. Use pdfimages from https://poppler.freedesktop.org/ to turn the pages of the pdf into images.

使用 Tesseract 检测旋转和 ImageMagick mogrify 修复它.

Use Tesseract to detect rotation and ImageMagick mogrify to fix it.

使用 OpenCV 查找和提取表格.

Use OpenCV to find and extract tables.

使用 OpenCV 从表格中查找并提取每个单元格.

Use OpenCV to find and extract each cell from the table.

使用 OpenCV 对每个单元格进行裁剪和清理,以免产生干扰 OCR 软件的噪音.

Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.

使用 Tesseract 对每个单元格进行 OCR.

Use Tesseract to OCR each cell.

将每个单元格的提取文本组合成您需要的格式.

Combine the extracted text of each cell into the format you need.

我编写了一个 Python 包,其中包含可以帮助完成这些步骤的模块.

I wrote a python package with modules that can help with those steps.

回购:https://github.com/eihli/image-table-ocr

文档和来源:https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

有些步骤不需要代码,它们利用了外部工具,例如 pdfimagestesseract.我将提供一些需要代码的步骤的简短示例.

Some of the steps don't require code, they take advantage of external tools like pdfimages and tesseract. I'll provide some brief examples for a couple of the steps that do require code.

  1. 查找表格:

在弄清楚如何查找表格时,此链接是一个很好的参考.https://answers.opencv.org/问题/63847/how-to-extract-tables-from-an-image/

This link was a good reference while figuring out how to find tables. https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

import cv2

def find_tables(image):
    BLUR_KERNEL_SIZE = (17, 17)
    STD_DEV_X_DIRECTION = 0
    STD_DEV_Y_DIRECTION = 0
    blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
    MAX_COLOR_VAL = 255
    BLOCK_SIZE = 15
    SUBTRACT_FROM_MEAN = -2

    img_bin = cv2.adaptiveThreshold(
        ~blurred,
        MAX_COLOR_VAL,
        cv2.ADAPTIVE_THRESH_MEAN_C,
        cv2.THRESH_BINARY,
        BLOCK_SIZE,
        SUBTRACT_FROM_MEAN,
    )
    vertical = horizontal = img_bin.copy()
    SCALE = 5
    image_width, image_height = horizontal.shape
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
    horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
    vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)

    horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
    vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))

    mask = horizontally_dilated + vertically_dilated
    contours, hierarchy = cv2.findContours(
        mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE,
    )

    MIN_TABLE_AREA = 1e5
    contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA]
    perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
    epsilons = [0.1 * p for p in perimeter_lengths]
    approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
    bounding_rects = [cv2.boundingRect(a) for a in approx_polys]

    # The link where a lot of this code was borrowed from recommends an
    # additional step to check the number of "joints" inside this bounding rectangle.
    # A table should have a lot of intersections. We might have a rectangular image
    # here though which would only have 4 intersections, 1 at each corner.
    # Leaving that step as a future TODO if it is ever necessary.
    images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
    return images

  1. 从表格中提取单元格.

这个和2非常相似,所以我不会包括所有的代码.我将参考的部分是对单元格进行排序.

This is very similar to 2, so I won't include all the code. The part I will reference will be in sorting the cells.

我们想从左到右、从上到下识别单元格.

We want to identify the cells from left-to-right, top-to-bottom.

我们会找到最左上角的矩形.然后我们将找到中心位于左上角矩形的顶部 y 和底部 y 值内的所有矩形.然后我们将根据它们中心的 x 值对这些矩形进行排序.我们将从列表中删除这些矩形并重复.

We’ll find the rectangle with the most top-left corner. Then we’ll find all of the rectangles that have a center that is within the top-y and bottom-y values of that top-left rectangle. Then we’ll sort those rectangles by the x value of their center. We’ll remove those rectangles from the list and repeat.

def cell_in_same_row(c1, c2):
    c1_center = c1[1] + c1[3] - c1[3] / 2
    c2_bottom = c2[1] + c2[3]
    c2_top = c2[1]
    return c2_top < c1_center < c2_bottom

orig_cells = [c for c in cells]
rows = []
while cells:
    first = cells[0]
    rest = cells[1:]
    cells_in_same_row = sorted(
        [
            c for c in rest
            if cell_in_same_row(c, first)
        ],
        key=lambda c: c[0]
    )

    row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
    rows.append(row_cells)
    cells = [
        c for c in rest
        if not cell_in_same_row(c, first)
    ]

# Sort rows by average height of their center.
def avg_height_of_center(row):
    centers = [y + h - h / 2 for x, y, w, h in row]
    return sum(centers) / len(centers)

rows.sort(key=avg_height_of_center)

这篇关于如何使用 Python 从 PDF 中提取表格作为文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆