使用 PDFBox 解析 PDF 文件(尤其是表格) [英] Parsing PDF files (especially with tables) with PDFBox

查看:151
本文介绍了使用 PDFBox 解析 PDF 文件(尤其是表格)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析一个包含表格数据的 PDF 文件.我正在使用 PDFBox 提取文件文本以稍后解析结果(字符串).问题是文本提取不像我对表格数据所期望的那样工作.例如,我有一个文件,其中包含一个这样的表(7 列:前两列总是有数据,只有一个 Complexity 列有数据,只有一个 Financing 列有数据):

I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):

+----------------------------------------------------------------+
| AIH | Value | Complexity                     | Financing       |
|     |       | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34  |      |                | 12.34     |     |
+----------------------------------------------------------------+
| abc | 1.56  |        | 1.56 |                |           | 1.56|
+----------------------------------------------------------------+

然后我使用 PDFBox:

Then I use PDFBox:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

这两行数据会被这样提取:

Those two lines of data would be extracted like this:

xyz 12.43 12.4312.43
abc 1.56 1.561.56

最后两个数字之间没有空格,但这不是最大的问题.问题是我不知道最后两个数字是什么意思:中、高、不适用?MAC/其他,FAE?我没有数字和它们的列之间的关系.

There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.

我不需要使用 PDFBox 库,因此使用其他库的解决方案就可以了.我想要的是能够解析文件并知道每个解析数字的含义.

It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.

推荐答案

您将需要设计一种算法来提取可用格式的数据.无论您使用哪个 PDF 库,您都需要这样做.字符和图形是通过一系列有状态的绘制操作绘制的,即移动到屏幕上的这个位置并绘制字符c"的字形.

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.

我建议您扩展 org.apache.pdfbox.pdfviewer.PDFPageDrawer 并覆盖 strokePath 方法.从那里您可以截取水平和垂直线段的绘制操作,并使用该信息来确定表格的列和行位置.然后设置文本区域并确定在哪个区域绘制哪些数字/字母/字符是一个简单的问题.由于您知道区域的布局,您将能够判断提取的文本属于哪一列.

I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.

此外,您在视觉上分隔的文本之间可能没有空格的原因是,通常情况下,PDF 不会绘制空格字符.取而代之的是更新文本矩阵并发出移动"的绘制命令以绘制下一个字符和与上一个字符分开的空格宽度".

Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.

祝你好运.

这篇关于使用 PDFBox 解析 PDF 文件(尤其是表格)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆