iText:使用LocationTextExtractionStrategy从pdf文件中提取的文本顺序错误 [英] iText: Extracted text from pdf file using LocationTextExtractionStrategy is in wrong order

查看:923
本文介绍了iText:使用LocationTextExtractionStrategy从pdf文件中提取的文本顺序错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iText从特定位置的pdf文件中提取一些文本。
为了做到这一点,我使用的是LocationTextExtractionStrategy:

  public static void main(String [] args)throws例外{

PdfReader pdfReader = new PdfReader(location_text_extraction_test.pdf);

矩形矩形=新矩形(38,0,516,516);

RenderFilter [] filter = {new RegionTextRenderFilter(rectangle)};
TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(),filter);
String text = PdfTextExtractor.getTextFromPage(pdfReader,1,strategy);

System.out.println(text);

pdfReader.close();
}



应该提取什么as:


 部分描述数量单价线路延长价格总额
着陆费1.00 407.84 $ USD 407.84 407.84 $


提取为:


 线路延长价格总额
部件描述数量单价
1.00 407.84 $ USD 407.84 407.84 $
着陆费


请注意,当我在Acrobat中打开pdf,选择所有t使用Ctrl + A转义,复制然后将其粘贴到文本编辑器中,所有内容都按正确顺序排列。



有没有办法解决问题?
非常感谢;)

解决方案

原因很简单就是Total For Line Extended Price是 y 坐标为507.37,而零件描述数量单位价格位于 y 坐标506.42。



LocationTextExtractionStrategy 仅通过考虑y坐标的整数部分来允许小的变化,但即使是整数部分也不同。因此,它假设前面的标题在后面的标题之上,并相应地输出结果。



如果出现这种变化,通常第一次尝试可能是尝试 SimpleTextExtractionStrategy 。不幸的是,这在这里没有用,因为前一个文本实际上是在后一个文本之前绘制的。因此,这种策略也会以错误的顺序返回标题。



在这种情况下,您需要一种不同的策略,例如:策略 Horizo​​ntalTextExtractionStrategy Horizo​​ntalTextExtractionStrategy2 (取决于您的iText版本,前者为iText 5.5.8,后者为当前开发代码5.5.9-SNAPSHOT)来自这个答案。使用它你将得到

 部分描述数量单价线路延长价格总额
着陆费1.00 407.84 $ USD 407.84 407.84 $
停车1.00 101.96 $ USD 101.96 101.96 $
??? 1.00 51.65 $ USD 51.65 51.65 $
Pax行李处理费5.00 8.49 $ USD 42.45 42.45 $
Pax机场税5.00 26.36 $ USD 131.80 131.80 $
Arr渡轮船员的GA码头适合1.00 125.00 $ 125.00 125.00 $
Pax on Dep贵宾休息室。 5.00 124.00 $ USD 620.00 620.00 $
停靠船员的GA终端。 1.00 125.00 $ USD 125.00 125.00 $
Guest on Dep的贵宾休息室。 1.00 38.00 $ USD 38.00 38.00 $
船员转让arr 1.00 70.00 $ USD 70.00 70.00 $
船员转让在dep 1.00 70.00 $ USD 70.00 70.00 $
卫生间服务1.00 75.00 $ USD 75.00 75.00 $
Catering-ISS 1.00 1,324.28 $ USD 1,324.28 1,324.28 $
地面处理1.00 190.00 $ USD 190.00 190.00 $
Pax Handling 1.00 190.00 $ USD 190.00 190.00 $
后推1.00 83.00 $ USD 83.00 83.00 $
拖车1.00 110.00 $ USD 110.00 110.00 $

(结果使用 TextExtraction 测试方法 testLocation_text_extraction_test



不幸的是,如果在不同的并排列中存在重叠的行,这些策略会失败,例如在您的文档中,发票收件人地址和其右侧的信息。



您可能尝试调整水平策略(例如,通过分析列分隔的水平间隙)或尝试一种组合方法,使用同一文档的多种策略输出。


I am using iText to extract some text from a pdf file at a specific location. In order to do that I am using the LocationTextExtractionStrategy:

public static void main(String[] args) throws Exception {

    PdfReader pdfReader = new PdfReader("location_text_extraction_test.pdf");

    Rectangle rectangle = new Rectangle(38, 0, 516, 516);

    RenderFilter[] filter = {new RegionTextRenderFilter(rectangle)};
    TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
    String text = PdfTextExtractor.getTextFromPage(pdfReader, 1, strategy);

    System.out.println(text);

    pdfReader.close();
}

Link to pdf file

The problem is that the extracted text is in the wrong order:

What should be extracted as:

Part Description Quantity Unit Price Total For Line Extended Price
Landing Fee 1.00 407.84 $ USD 407.84 407.84 $

is extracted as:

Total For Line Extended Price
Part Description Quantity Unit Price
1.00 407.84 $ USD 407.84 407.84 $
Landing Fee

Note that when I open the pdf in Acrobat, select all the text with Ctrl+A, copy and then paste it in a text editor everything is in the correct order.

Is there a way to resolve the problem ? Thanks a lot ;)

解决方案

The cause for this simply is that "Total For Line Extended Price" is at a y coordinate of 507.37 while "Part Description Quantity Unit Price" is at a y coordinate of 506.42.

The LocationTextExtractionStrategy allows for small variations by only considering the integer part of the y coordinates but even the integer parts differ here. Thus, it assumes the former headings to be on a line above the latter ones and outputs its results accordingly.

In case of such variations usually a first attempt might be to try the SimpleTextExtractionStrategy. Unfortunately this does not help here as the former text actually is drawn before the latter text. Thus, this strategy also returns the headings in the wrong order.

In such a situation you need a strategy that works differently, e.g. the strategy HorizontalTextExtractionStrategy or HorizontalTextExtractionStrategy2 (depending on your iText version, the former one up to iText 5.5.8, the latter one for the current development code 5.5.9-SNAPSHOT) from this answer. Using it you'll get

Part Description Quantity Unit Price Total For Line Extended Price
Landing Fee 1.00 407.84 $ USD 407.84 407.84 $
Parking 1.00 101.96$ USD 101.96 101.96$
??? 1.00 51.65$ USD 51.65 51.65$
Pax Baggage Handling Fee 5.00 8.49$ USD 42.45 42.45 $
Pax Airport Tax 5.00 26.36 $ USD 131.80 131.80$
GA terminal for crew on Arr ferry fit 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Pax on Dep. 5.00 124.00$ USD 620.00 620.00 $
GA terminal for crew on dep. 1.00 125.00$ USD 125.00 125.00$
VIP lounge for Guest on Dep. 1.00 38.00$ USD 38.00 38.00 $
Crew transfer on arr 1.00 70.00 $ USD 70.00 70.00 $
Crew transfer on dep 1.00 70.00 $ USD 70.00 70.00 $
Lavatory Service 1.00 75.00 $ USD 75.00 75.00 $
Catering-ISS 1.00 1,324.28 $ USD 1,324.28 1,324.28 $
Ground Handling 1.00 190.00$ USD 190.00 190.00$
Pax Handling 1.00 190.00$ USD 190.00 190.00$
Push Back 1.00 83.00 $ USD 83.00 83.00 $
Towing 1.00 110.00$ USD 110.00 110.00$

(result of using TextExtraction test method testLocation_text_extraction_test)

Unfortunately, though, these strategies fail if there are overlapping lines in different side-by-side columns, e.g. in your document the invoice recipient address and the information to its right.

You might either try to tweak the horizontal strategies (e.g. by also analyzing horizontal gaps separating columns) or try a combined approach, using the output of multiple strategies for the same document.

这篇关于iText:使用LocationTextExtractionStrategy从pdf文件中提取的文本顺序错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆