在C#中的PDF文本提取 [英] Extracting text from PDFs in C#

查看:135
本文介绍了在C#中的PDF文本提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

倒也干脆,我要撕裂文字了多个PDF文件(颇多实际上)为了在SQL数据库坚持之前分析的内容。

Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database.

我发现,这类工作(最好的一个使用iTextSharp的)一些非常粗略的免费C#库,但也有许许多多的格式错误,某些字符混乱和时间有空格('')无处不在很多 - 里面的话,每个字母之间,其中大块占用了几行,这一切似乎有点随意。

I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp), but there are umpteen formatting errors and some characters are scrambled and alot of the time there are spaces (' ') EVERYWHERE - inside words, between every letter, huge blocks of them taking up several lines, it all seems a bit random.

是否有这样做的,我完全可以俯瞰的任何简单的方法(很有可能!),或者是一个艰巨的任务有点,涉及把取出的字节值成信可靠?

Is there any easy way of doing this that I'm completely overlooking (quite likely!) or is it a bit of an arduous task that involves converting the extracted byte values into letters reliably?

干杯,

邓肯

推荐答案

您可以看一看的这篇文章。它是基于优秀的 iTextSharp的库。

You may take a look at this article. It's based on the excellent iTextSharp library.

这篇关于在C#中的PDF文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆