在java中提取PDF的页脚数据 [英] Extract footer data of PDF in java

查看:33
本文介绍了在java中提取PDF的页脚数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我能够从字符串中的 pdf 页面获取数据.但除此之外,还提取了页脚数据.我想从pdf的所有页面中删除那些.我怎样才能删除它我使用了 Rectangle2D 但坐标没有给出数据

I am able to get data from pdf pages in a string. But along with those, footer data is also extracted. I want to remove those from all the pages of pdf. How can I remove that I used Rectangle2D but coordinates are not giving data

推荐答案

OP 在评论中表示他使用了此代码:

In a comment the OP indicated that he used this code:

PDDocument doc = PDDocument.load("xyz.pdf");
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( 1 );
Rectangle2D region = new Rectangle2D.Double(10, 10, 10, 10);
String regionName = "region";
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
System.out.println("Region is "+ stripper.getTextForRegion("region"));

对于大多数文档,此代码不会提取任何文本,因为它查看第二个文档页面左上角区域中的一个小 (10x10 pt) 区域.因此,new Rectangle2D.Double(10, 10, 10, 10) 中的值必须改变.

For most documents this code will extract no text because it looks at a small (10x10 pt) region in the upper left region of the second document page. Thus, the values in new Rectangle2D.Double(10, 10, 10, 10) have to change.

我尝试了各个地区,但没有收到任何文本,如果您有普通 pdf 页面的想法,您应该分享

I tried with various regions , yet I am not getting any text, If you have idea for a normal pdf page , you should share

没有什么比普通的 pdf 页面更好的了.PDF 的目标是使用户能够轻松可靠地交换和查看电子文档,而不受创建环境或查看或打印文档的环境的影响. 没有严格的限制页面尺寸或页面内容的位置.

There is nothing like a normal pdf page. The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. There is no serious restriction on page dimensions or location of content on pages.

例如对于此表格

你需要这样的值

PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
Rectangle2D region = new Rectangle2D.Float(0f, 230f, 612f, 300f);

提取正文我授权任何健康计划......我已收到此授权的副本."没有页眉、页脚或表单行.

to extract the body "I authorize any health plan ... I have received a copy of this authorization." without headers, footers, or form lines.

如果您有许多相似的页面(例如,一个包含许多页面且布局相似的大文档),您必须测量一次,但要提取许多页面.

If you have many similar pages (e.g. one large document with many pages with a similarly layout), you have to measure but once for many pages to extract.

这篇关于在java中提取PDF的页脚数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆