使用PDFBox解析PDF文件(尤其是表格) [英] Parsing PDF files (especially with tables) with PDFBox

查看:1855
本文介绍了使用PDFBox解析PDF文件(尤其是表格)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析包含表格数据的PDF文件。我正在使用 PDFBox 提取文件文本以便稍后解析结果(字符串)。问题是文本提取不像我期望的表格数据那样工作。例如,我有一个包含这样的表的文件(7列:前两个总是有数据,只有一个Complexity列有数据,只有一个Financing列有数据):

I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):

+----------------------------------------------------------------+
| AIH | Value | Complexity                     | Financing       |
|     |       | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34  |      |                | 12.34     |     |
+----------------------------------------------------------------+
| abc | 1.56  |        | 1.56 |                |           | 1.56|
+----------------------------------------------------------------+

然后我使用PDFBox:

Then I use PDFBox:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

这两行数据将被提取如下:

Those two lines of data would be extracted like this:

xyz 12.43 12.4312.43
abc 1.56 1.561.56

最后两个数字之间没有空格,但这不是最大的问题。问题是我不知道最后两个数字是什么意思:中,高,不适用? MAC /其他,FAE?我没有数字和列之间的关系。

There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.

我不需要使用PDFBox库,所以使用另一个库的解决方案很好。我想要的是能够解析文件并知道每个解析的数字意味着什么。

It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.

推荐答案

你需要设计一个算法以可用格式提取数据。无论您使用哪个PDF库,都需要执行此操作。字符和图形由一系列有状态的绘制操作绘制,即移动到屏幕上的这个位置并绘制字符'c'的字形。

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.

我建议你extend org.apache.pdfbox.pdfviewer.PDFPageDrawer 并覆盖 strokePath 方法。从那里,您可以截取水平和垂直线段的绘制操作,并使用该信息确定表的列和行位置。然后简单的设置文本区域和确定在哪个区域绘制数字/字母/字符。由于您知道区域的布局,因此您将能够确定提取的文本所属的列。

I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.

此外,文本之间可能没有空格的原因视觉上是分开的,通常情况下,PDF不会绘制空格字符。而是更新文本矩阵并发出移动的绘图命令以绘制下一个字符和除最后一个字符之外的空间宽度。

Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.

祝你好运。

这篇关于使用PDFBox解析PDF文件(尤其是表格)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆