使用python处理pdf表 [英] Working on tables in pdf using python

查看:105
本文介绍了使用python处理pdf表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理pdf文件.该pdf中有许多表格.
根据pdf中提供的表格名称,我想使用python从该表格中获取数据.

I am working on a pdf file. There is number of tables in that pdf.
According to the table names given in the pdf, I wanted to fetch the data from that table using python.

我曾经从事过html,xlm解析,但从未使用过pdf.
谁能告诉我如何使用python从pdf中获取表格?

I have worked on html, xlm parsing but never with pdf.
Can anyone tell me how to fetch tables from pdf using python?

推荐答案

我最近遇到了类似的问题,并编写了一个库来帮助解决该问题: pdfquery .

I had a similar problem recently, and wrote a library to help solve it: pdfquery.

PDFQuery通过PDF创建一个元素树(使用pdfminer,并带有一些额外的糖),并允许您使用JQuery或XPath选择器从页面中获取元素,这些元素主要基于元素的文本内容或位置.因此,要解析一个表,您首先需要通过搜索标签来找到它在文档中的位置:

PDFQuery creates an element tree from the PDF (using pdfminer, with some extra sugar) and lets you fetch elements from the page using JQuery or XPath selectors, based mostly on the text contents or locations of the elements. So to parse a table, you would first find where it is in the document by searching for the label:

label = pdf.pq(':contains("Name of your table")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))

然后,您将继续在表格下方搜索行,直到搜索未返回结果为止:

Then you would keep searching for lines underneath the table, until the search didn't return results:

page = label.closest('LTPage')
while 1:
    row = pdf.extract( [
             ('column_1', ':in_bbox("%s,%s,%s,%s")' % (left_corner+10, bottom_corner+40, left_corner+50, bottom_corner+20)),
             ('column_2', ':in_bbox("%s,%s,%s,%s")' % (left_corner+50, bottom_corner+40, left_corner+80, bottom_corner+20))
         ], page)
    if not row['column_1'] or row['column_2']:
        break
    print "Got row:", matches
    bottom_corner -= 20

这假设您的行高为20点,第一行从标签下方开始20点,第一列从标签的左边缘起10到50点,第二列从50到80 pt从标签的左边缘开始.

This assumes that your rows are 20 pts high, the first one starts 20 pts below the label, the first column spans from 10 to 50 points from the left edge of the label, and the second column spans from 50 to 80 pts from the left edge of the label.

如果您有空白行或高度不同的行,这将变得更加烦人.如果表中的条目足够接近以至于使解析器认为它只是一行,那么您可能还需要使用merge_tags = None选项来选择单个字符而不是单词.但是希望这可以使您更接近...

If you have blank lines or lines with varying heights, this is going to get more annoying. You may also need to use the merge_tags=None option to select individual characters rather than words, if the entries in the table are close enough to make the parser think it's just one line. But hopefully this gets you closer ...

这篇关于使用python处理pdf表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆