具有文本和坐标的PDF解析 [英] PDF Parsing with Text and Coordinates
问题描述
我目前正在使用PDF Box解析pdf,我试图弄清楚如何检索有关文本的数据,例如字体(粗体,大小等)和字体位置.
I am currently using PDF Box to parse a pdf and I am trying to figure out how to retrieve data about the text such as the font (bold, size, etc) and the location of the font.
有什么建议吗?
推荐答案
After poking around the (hard to find) PDFBox docs, I found this little gem.
显然,其中一个示例准确显示了如何完成您所要求的一切.基本上,您继承了PdfTextStripper
的子类并覆盖了processTextPosition
方法.在那里,您可以查询 TextPosition
您需要的信息.
Apparently one of the examples shows exactly how to do everything you asked. Basically, you subclass PdfTextStripper
and override the processTextPosition
method. There, you query the TextPosition
for whatever information you need.
供将来参考,您可以在这里找到javaDoc: http://pdfbox.apache. org/apidocs/index.html
For future reference, you can find the javaDoc here: http://pdfbox.apache.org/apidocs/index.html
Edit 2018-04-02: original link is dead, but example can be found in the SVN repo here.
这篇关于具有文本和坐标的PDF解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!