从pdf提取表格 [英] Extracting tables from a pdf

查看:81
本文介绍了从pdf提取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此 PDF .我已经尝试了pdfminer和pypdf,但运气不佳,但是我真的无法从表中获取数据.

I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables.

这是其中一张表的样子:

This is what one of the tables looks like:

如您所见,有些列用'x'标记.我正在尝试将此表放入对象列表.

As you can see, some columns are marked with an 'x'. I'm trying to this table into a list of objects.

到目前为止,这是代码,我现在正在使用pdfminer.

This is the code so far, I'm using pdfminer now.

# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os


def pdfToText(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos = set()

    records = []
    i = 1
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                                  caching=caching, check_extractable=True):
        # process page
        interpreter.process_page(page)

        # only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
        lines = retstr.getvalue().splitlines()

        idx = containsSubString(lines, 'Tool')
        lines = lines[idx+1:]
        idx = containsSubString(lines, "1 The 'All'")
        lines = lines[:idx]

        for line in lines:
            records.append(line)
        i += 1

    fp.close()
    device.close()
    retstr.close()

    return records


def containsSubString(list, substring):
    # find a substring in a list item
    for i, s in enumerate(list):
        if substring in s:
            return i
    return -1


# process pdf
fn = '../test1.pdf'
ft = 'test.txt'

text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
    outFile.write(text[i])
outFile.close()

这将产生一个文本文件,并获取所有文本,但是x不会保留间距.输出看起来像这样:

That produces a text file and it gets all of the text but, the x's don't have the spacing preserved. The output looks like this:

x在文本文档中只是等距

The x's are just single spaced in the text document

现在,我只是生成文本输出,但我的目标是使用表中的数据生成html文档.我一直在寻找OCR示例,其中大多数似乎令人困惑或不完整.我愿意使用C#或任何其他可能产生所需结果的语言.

Right now, I'm just producing text output but my goal is to produce an html document with the data from the tables. I've been searching for OCR examples, and most of them seem confusing or incomplete. I'm open to using C# or any other language that might produce the results I'm looking for.

编辑:会有许多这样的pdf文件,我需要从中获取表格数据.所有pdf的标题都相同(据我所知).

There will be multiple pdfs like this that I need to get the table data from. The headers will be the same for all pdfs (s far as I know).

推荐答案

我知道了,我走错了方向.我所做的是在pdf中为每个表格创建png,现在我正在使用opencv& python.

I figured it out, I was going in the wrong direction. What I did was create pngs of each table in the pdf and now I'm processing the images using opencv & python.

这篇关于从pdf提取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆