通过坐标提取PDF文本 [英] Extract PDF text by coordinates

查看:485
本文介绍了通过坐标提取PDF文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道Microsoft .NET中是否有一些PDF库能够通过提供坐标来提取文本.

I'd like to know if there's some PDF library in Microsoft .NET being able of extracting text by giving coordinates.

例如(使用伪代码):

PdfReader reader = new PdfReader();
reader.Load("file.pdf");

// Top, bottom, left, right in pixels or any other unit
string wholeText = reader.GetText(100, 150, 20, 50);

我尝试使用PDFBox for .NET(该工具在IKVM上运行)来实现此目的,但是没有运气,而且它似乎已经过时且没有文档说明.

I've tried to do so using PDFBox for .NET (that one working on top of IKVM) with no luck, and it seems to be very outdated and undocumented.

也许任何人都可以使用PDFBox,iTextSharp或任何其他开源库来做这件事,他/她可以给我提示.

Perhaps anyone has a good sample of doing so with PDFBox, iTextSharp or any other open-sourced library, and he/she can give me a hint.

谢谢.

推荐答案

好,谢谢大家的努力.

Well, thank you for your effort anyone.

我在IKVM编译的基础上使用Apache的PDFBox来获得它,这是最终代码:

I got it using Apache's PDFBox on top of IKVM compilation, and this is the final code:

PDDocument doc = PDDocument.load(@"c:\invoice.pdf");

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("testRegion", new java.awt.Rectangle(0, 10, 100, 100));
stripper.extractRegions((PDPage)doc.getDocumentCatalog().getAllPages().get(0));

string text = stripper.getTextForRegion("testRegion");

它就像一种魅力.

无论如何,谢谢,我希望我自己的回答能对其他人有所帮助.如果您需要更多详细信息,请在此处注释掉,我将更新此答案.

Thank you anyway and I hope my own answer will help others. If you need further details, just comment out here and I'll update this answer.

这篇关于通过坐标提取PDF文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆