无需工具即可提取PDF文本 [英] Extract text of PDF without tool

查看:119
本文介绍了无需工具即可提取PDF文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我正在使用itextsharp工具(在VB.net中)提取PDF的文本. 我想独立于其他工具/库,因为我无法在程序中将它们提供给其他人.

Currently I'm extracting the text of PDF's with the itextsharp tool (in VB.net). I'd like to be independent of other tools / libraries as I can't give them to others along my programm.

是否有任何编程语言都可以快速提取PDF文本的解决方案(没有.dll等)?

Is there a solution (no .dll etc) in any programming language to quickly extract the text of a PDF?

推荐答案

简短答案:

当然,有一种方法可以做到这一点. iText (以及许多其他PDF库)都可以做到这一点.因此,存在一种用于提取文本的算法.

Of course there is a way of doing this. iText (alongside many other PDF libraries) are capable of doing it. So there is an algorithm for extracting text.

详细答案:

PDF不是所见即所得的格式. PDF文档是互相引用的对象"与编程语言"之间的一种不敬虔的结合.

PDF is not a WYSIWYG format. A PDF document is sort of an ungodly marriage between "objects that reference eachother" and "programming language".

让我解释一下. PDF文档具有图形状态.因此,每当您在PDF文档中(例如在Adobe Reader等查看器中)看到文本时,您实际上就可以看到PDF文档中某些代码"的结果,

Let me explain. A PDF document has a graphics state. So whenever you see text in a PDF document (in a viewer like Adobe Reader), you are essentially seeing the result of some 'code' in the PDF document that says

转到位置50,720
将活动字体设置为Helvetica,字体大小12
将活动图形颜色设置为黑色
绘制与字符"H"相对应的字形
转到位置53,720
绘制对应于字符"e"的字形
等等

Go to position 50, 720
Set the active font to Helvetica, fontsize 12
Set the active drawing color to black
draw the glyph that corresponds to the character 'H'
Go to position 53, 720
draw the glyph that corresponds to the character 'e'
etc

指令和资源(如字体,图像,矢量图形)可以在对象中分组在一起.

Instructions and resources (like fonts, images, vector graphics) can be grouped together in objects.

为每个对象分配一个数字,并在交叉引用表(在PDF文档末尾)中明确提及.

Each object is assigned a number, and is mentioned explictly in the cross-reference table (at the end of the PDF document).

因此,要读取PDF文档中的文本,您需要:

So, in order to read the text from a PDF document you would need to:

  1. 读取XREF表
  2. 弄清楚\ page对象从哪里开始(字节位置)
  3. 解析\ page对象及其所有子对象(再次使用XREF表确定这些子对象中的每一个在文件中的位置)
  4. 解析几何指令(图形状态不需要与文本相同的方向流动)
  5. 根据您希望文本被写入的方向对所有可见字符(比较背景和前景色,被其他对象(例如图像)遮挡)进行分类
  6. 构建返回字符串

这可能就是其他人使用库的原因. 不要误会我的意思,我非常喜欢自己做(这是对某些事情的工作方式有深入了解的最好方法).

And that is probably why other people use libraries. Don't get me wrong, I'm a huge fan of doing it yourself (it's the best way to gain a deep knowledge on how certain things work).

但是请从您的一位用户的角度来看它. 您会更信任什么?

But look at it from the point of view of one of your users. What would you trust more?

  • 使用自写"代码处理PDF文档(解析PDF文档的总经验<1年)的程序,
  • 或仅调用PDF库的程序( 解析PDF文档> 20年)
  • A program that uses 'self written' code to handle PDF documents (total experience in parsing PDF documents < 1 year),
  • or a program that simply calls a PDF library (total experience in parsing PDF documents > 20 years)

这篇关于无需工具即可提取PDF文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆