pypdf不从pdf中提取表格 [英] pypdf not extracting tables from pdf

查看：129 发布时间：2020/7/4 21:27:45 python pypdf

本文介绍了pypdf不从pdf中提取表格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用pypdf从pdf文件中提取文本.问题在于pdf文件中的表未提取.我也尝试过使用pdfminer，但是我遇到了同样的问题.

I am using pypdf to extract text from pdf files . The problem is that the tables in the pdf files are not extracted. I have also tried using the pdfminer but i am having the same issue .

推荐答案

问题是PDF中的表格通常由绝对定位的行和字符组成，并且将其转换为明智的表格表示形式并非易事.

The problem is that tables in PDFs are generally made up of absolutely positioned lines and characters, and it is non-trivial to convert this into a sensible table representation.

在Python中，PDFMiner可能是最好的选择.它为您提供了布局对象的树形结构，但是您将不得不通过查看行(LTLine)和文本框(LTTextBox)的位置来解释表. 这里有一些文档.

In Python, PDFMiner is probably your best bet. It gives you a tree structure of layout objects, but you will have to do the table interpreting yourself by looking at the positions of lines (LTLine) and text boxes (LTTextBox). There's a little bit of documentation here.

或者， PDFX 尝试这种操作(并且通常会成功)，但是您必须将其用作Web服务(不理想，但对偶尔的工作来说很好).要从Python执行此操作，您可以执行以下操作:

Alternatively, PDFX attempts this (and often succeeds), but you have to use it as a web service (not ideal, but fine for the occasional job). To do this from Python, you could do something like the following:

import urllib2
import xml.etree.ElementTree as ET

# Make request to PDFX
pdfdata = open('example.pdf', 'rb').read()
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'})
response = urllib2.urlopen(request).read()

# Parse the response
tree = ET.fromstring(response)
for tbox in tree.findall('.//region[@class="DoCO:TableBox"]'):
    src = ET.tostring(tbox.find('content/table'))
    info = ET.tostring(tbox.find('region[@class="TableInfo"]'))
    caption = ET.tostring(tbox.find('caption'))

这篇关于pypdf不从pdf中提取表格的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pypdf不从pdf中提取表格 [英] pypdf not extracting tables from pdf

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pypdf不从pdf中提取表格 [英] pypdf not extracting tables from pdf

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭