如何使用Python从PDF文件中提取图表/表格/图形? [英] How to extract charts/tables/graphs from PDF files using Python?

查看:1315
本文介绍了如何使用Python从PDF文件中提取图表/表格/图形?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

搜索了很多,但是由于找不到此类问题的解决方案,因此在同一问题上发布了明确的问题.大多数答案都涵盖了图像/文本提取,相对来说比较容易.

Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.

我需要分别从PDF中提取表格和图形分别为文本(csv)和图像.

I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.

任何人都可以通过高效的python 3.6代码来帮助我解决相同问题吗?

Can anyone help me with an efficient python 3.6 code to solve the same?

到目前为止,我可以使用startmark = b"\ xff \ xd8"和endmark = b"\ xff \ xd9"来提取jpg,但并非PDF中的所有表和图形都是纯jpg,因此我的代码在实现这一目标.

Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.

例如,我想从第11页中提取表格并从第12页中提取图形作为图像或从下面的给定链接中可行的内容.怎么做?

Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?

https://hartmannazurecdn.azureedge.net/media/2369 /annual-report-2017.pdf

推荐答案

要提取,您可以使用迷彩

这是一个文章.

对于图像,我发现了这个问题并回答了从中提取图像无需重新采样的PDF,在python中?

For images I've found this question and answer Extract images from PDF without resampling, in python?

这篇关于如何使用Python从PDF文件中提取图表/表格/图形?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆