使用Camelot进行Python PDF解析并提取表标题 [英] Python PDF Parsing with Camelot and Extract the Table Title

查看:761
本文介绍了使用Camelot进行Python PDF解析并提取表标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Camelot是一个很棒的Python库,可以从pdf文件中提取表作为数据框.但是,我正在寻找一种解决方案,该解决方案还可以返回表格正上方的表格描述文本.

Camelot is a fantastic Python library to extract the tables from a pdf file as a data frame. However, I'm looking for a solution that also returns the table description text written right above the table.

我用于从pdf中提取表格的代码是这样的:

The code I'm using for extracting tables from pdf is this:

import camelot
tables = camelot.read_pdf('test.pdf', pages='all',lattice=True, suppress_stdout = True)

我想提取写在表格上方的文本,即具体,如下图所示.

I'd like to extract the text written above the table i.e THE PARTICULARS, as shown in the image below.

对我来说,什么是最好的方法?感谢您的帮助.谢谢

What should be a best approach for me to do it? appreciate any help. thank you

推荐答案

您可以直接创建莱迪思解析器

You can create the Lattice parser directly

            parser = Lattice(**kwargs)
            for p in pages:
                t = parser.extract_tables(p, suppress_stdout=suppress_stdout,
                                          layout_kwargs=layout_kwargs)
                tables.extend(t)

然后您可以访问parser.layout,其中包含页面中的所有组件.这些组件都具有bbox (x0, y0, x1, y1),提取的表也具有bbox对象.您可以找到最接近表格顶部的组件,然后提取文本.

Then you have access to parser.layout which contains all the components in the page. These components all have bbox (x0, y0, x1, y1) and the extracted tables also have a bbox object. You can find the closest component to the table on top of it and extract the text.

这篇关于使用Camelot进行Python PDF解析并提取表标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆