Pdfplumber 无法识别表 python [英] Pdfplumber cannot recognise table python

查看:35
本文介绍了Pdfplumber 无法识别表 python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Pdfplumber 提取第 2 页第 3 节中的表格(通常).但它仅适用于某些 pdf,其他则不起作用.对于失败的 pdf 文件,似乎 Pdfplumber 读取按钮表而不是我想要的表.

我怎样才能拿到桌子?无效的pdf链接:

但是我想要在第 2 页中的表格是

但是,此代码适用于 pdfB(我在上面提到过).

顺便说一句,我想要在每个 pdf 中的表格在第 3 节中.

有人可以帮忙吗?

非常感谢琼

解决方案

嘿,这是该问题的正确解决方案,但首先请阅读我下面的一些观点

  • 好吧,您使用 pdfplumber 进行表格提取,但我认为您应该阅读有关表格设置的内容,当您根据需要阅读表格设置时,您肯定会从那里找到答案.PdfPlumber API - 用于表提取
  • 截至目前,我在下面为您的问题提供了完美的解决方案,但首先请正确检查 pdfplumber API 的文档,您肯定可以从那里找到所有答案,而且我相信将来您不需要提出有关使用 pdfplumber 进行表格提取,因为您肯定会从那里找到有关表格提取以及其他内容(例如文本提取、单词提取等)的所有解决方案.
  • 为了更好地理解表格设置,您还可以使用可视化调试,这是 pdfplumber 的最佳功能,用于了解表格设置对表格的作用以及它如何使用表格设置提取表格.表格的可视化调试

以下是您问题的解决方案,

将pandas导入为pd进口pdf水管工pdf = pdfplumber.open(GSAP_msds_01259319.pdf")p1 = pdf.pages[1]table = p1.extract_table(table_settings={vertical_strategy": lines",horizo​​ntal_strategy":文本",snap_tolerance":4,})df = pd.DataFrame(table[1:], columns=table[0])df

查看上面代码的输出

I use Pdfplumber to extract the table on page 2, section 3 (normally). But it only works on some pdf, others do not work. For failed pdf files, it seems like Pdfplumber read the button table instead of the table I want.

How can I get the table? link of the pdf which doesn't work: pdfA

link of the pdf which works: pdfB

Here is my code:

import pdfplumber
pdf = pdfplumber.open("/Users/chueckingmok/Desktop/selenium/Shell Omala 68.pdf")
page = pdf.pages[1]
table=page.extract_table()

import pandas as pd
df = pd.DataFrame(table[1:], columns=table[0])
df

and the result is

But the table I want in page 2 is

However, this code works for pdfB (which I mentioned above).

Btw, the table I want in each pdf is in section 3.

Anyone can help?

Many thanks Joan

解决方案

Hey Here is the proper solution for that problem but first please read some of my points below

  • Well, you used pdfplumber for table extraction but i think you should have read about settings of tables, there are so many settings of table when you read them according to your need you surely find your answers from there. PdfPlumber API - for Table Extraction is Here
  • As of now i give perfect solution for your problem in below, but first check documentation of pdfplumber API properly you can surely find all your answers from there, and i am sure that in future you don't need to ask question regarding table extraction using pdfplumber because you will surely find all your solution from there regarding table extraction and also other things like text extraction, word extraction, etc.
  • For better understanding of the tables settings you can also use Visual Debugging, this is very best feature of pdfplumber for knowing what exactly table settings does with table and how it extract the tables using table settings.Visual Debugging of Tables

Below Is the solution of your problem,

import pandas as pd
import pdfplumber 
pdf = pdfplumber.open("GSAP_msds_01259319.pdf")
p1 = pdf.pages[1]
table = p1.extract_table(table_settings={"vertical_strategy": "lines", 
                                         "horizontal_strategy": "text", 
                                         "snap_tolerance": 4,})
df = pd.DataFrame(table[1:], columns=table[0])
df

See the output of the Above Code

这篇关于Pdfplumber 无法识别表 python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆