在 C# 中从 PDF 中提取文本 [英] Extracting text from PDFs in C#

查看:51
本文介绍了在 C# 中从 PDF 中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很简单,我需要从多个 PDF 中提取文本(实际上相当多)以便在将其粘贴到 SQL 数据库之前分析内容.

Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database.

我发现了一些非常粗略的免费 C# 库,它们可以正常工作(最好的库使用 iTextSharp),但是有无数的格式错误和一些字符被打乱,而且很多时候到处都有空格 (' ') -字里行间,每个字母之间,一大块占了几行,似乎有点随意.

I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp), but there are umpteen formatting errors and some characters are scrambled and alot of the time there are spaces (' ') EVERYWHERE - inside words, between every letter, huge blocks of them taking up several lines, it all seems a bit random.

有没有什么简单的方法可以做到这一点,我完全忽略了(很有可能!),或者它是一项艰巨的任务,涉及将提取的字节值可靠地转换为字母?

Is there any easy way of doing this that I'm completely overlooking (quite likely!) or is it a bit of an arduous task that involves converting the extracted byte values into letters reliably?

推荐答案

可靠地执行此操作可能存在一些困难.问题是 PDF 是一种演示格式,它重视良好的排版.假设您只想输出一个单词:Tap.

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

PDF 渲染引擎可能会将其输出为 2 个单独的调用,如以下伪代码所示:

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")

这样做是因为字母 T 和 a 之间的默认字距调整(字母间距)可能不被渲染引擎接受,或者它可能会在它们之间添加或删除一些微空间字符以获得完全对齐的行.这最终导致在 PDF 中找到的实际文本片段通常不是完整的单词,而是它们的一部分.

This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

这篇关于在 C# 中从 PDF 中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆