使用Python从PDF中的物理坐标返回文本字符串 [英] Return text string from physical coordinates in a PDF with Python

查看:601
本文介绍了使用Python从PDF中的物理坐标返回文本字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去的几个小时中,我一直在与Google和PDFMiner的有限文档作战,尽管我感到很亲密,但我仍无法获得所需的东西.我已经完成了 http://www.unixuser.org/~euske/python/pdfminer /和所有三个YouTube视频,以更好地了解PDF,我可以输出原始文本.

I have been battling with Google and the limited documentation of PDFMiner for the last several hours, and although I feel close, I'm just not getting what I need. I've worked through http://www.unixuser.org/~euske/python/pdfminer/ and all three of the YouTube videos to gain a better understanding about PDFs and I'm able to output raw text just fine.

我正在研究一个脚本来解析多个PDF页面.不幸的是,对于这个项目,我处理的是质量较差的PDF文件,我看到的唯一可靠的常数是文本字符串的物理位置完全相同.尽管我读过一些暗示,可以通过物理坐标提取文本字符串,但是我还没有看到一个可行的示例.

I am working on a script to parse multiple PDF pages. Unfortunately, for this project I am dealing with poor quality PDF files, and the only reliable constant I see is the physical location of text strings being exactly the same. Although I've read hints that text strings can be extracted by physical coords, I have yet to see a working example.

有没有人可以阐明如何使用PDFMiner做到这一点?如果有明显更好的选择,我可以开放其他模块,但是我需要坚持使用Python作为脚本.

Is there anyone out there who could shed some light on how this is done with PDFMiner? I am open to other modules if there is an obvious better choice, however I need to stick with Python for the script.

此外,我也尝试过PyPdf也没有成功(除了基本文本输出).

Additionally, I have tried PyPdf to no success as well (other than basic text output).

谢谢!

推荐答案

由于Denis Papathanasiou编写了一些代码,我得以找到解决pdfminer的方法.在他的博客中讨论了该代码,您可以在此处找到源代码: layout_scanner.py

I was able to find my way around pdfminer thanks to some code by Denis Papathanasiou. The code is discussed in his blog, and you can find the source here: layout_scanner.py

尤其要看一下parse_lt_objs()方法.在最后一个循环中,k应该是一对,其中包含该文本位的坐标(并且将其丢弃).我没有要在这里发布的工作坐标提取器(我对它们不感兴趣),但是听起来您从那里找到自己的路很容易.

In particular, take a look at the method parse_lt_objs( ). In the final loop, k should be a pair containing the coordinates of that bit of text (and it is discarded). I don't have a working coordinate extractor to post here (I was not interested in them), but it sounds like you'll have no trouble finding your way from there.

祝你好运!

这篇关于使用Python从PDF中的物理坐标返回文本字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆