以编程方式从PDF文件中的扫描识别文本 [英] Programmatically recognize text from scans in a PDF File

查看:118
本文介绍了以编程方式从PDF文件中的扫描识别文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PDF文件,其中包含我们需要导入到数据库中的数据.这些文件似乎是打印的字母数字文本的pdf扫描.看起来像10点.英语字体格式一种.

I have a PDF file, which contains data that we need to import into a database. The files seem to be pdf scans of printed alphanumeric text. Looks like 10 pt. Times New Roman.

是否有任何工具或组件可以让我识别和解析此文本?

Are there any tools or components that can will allow me to recognize and parse this text?

推荐答案

我已经成功使用 pdftohtml 将表格从PDF剥离为CSV.它基于 Xpdf ,它是一种更通用的工具,其中包括 pdftotext .我只是将其包装为C#中的Process.Start调用.

I've used pdftohtml to successfully strip tables out of PDF into CSV. It's based on Xpdf, which is a more general purpose tool, that includes pdftotext. I just wrap it as a Process.Start call from C#.

如果您需要更多DIY,可以使用 iTextSharp 库-Java的一个移植 iText -和 PDFBox (是的,它说的是Java-但它们具有.a版本的 IKVM.NET ).这是有关使用 iTextSharp

If you're looking for something a little more DIY, there's the iTextSharp library - a port of Java's iText - and PDFBox (yes, it says Java - but they have a .NET version by way of IKVM.NET). Here's some CodeProject articles on using iTextSharp and PDFBox from C#.

而且,如果您真的是受虐狂,则可以致电Adobe的 IFilter规范非常简单,但是我猜想互操作开销会很重要.

And, if you're really a masochist, you could call into Adobe's PDF IFilter with COM interop. The IFilter specs is pretty simple, but I would guess that the interop overhead would be significant.

重新阅读问题和后续答案后,很明显,OP正在处理其PDF中的图像.在这种情况下,您需要提取图像(上面的PDF库能够很容易地做到这一点)并通过OCR引擎运行它.

After re-reading the question and subsequent answers, it's become clear that the OP is dealing with images in his PDF. In that case, you'll need to extract the images (the PDF libraries above are able to do that fairly easily) and run it through an OCR engine.

我以前交互式地使用过 MODI ,效果不错.它是COM,因此通过互操作从C#调用它也是可行和漂亮的简单:

I've used MODI interactively before, with decent results. It's COM, so calling it from C# via interop is also doable and pretty simple:

' lifted from http://en.wikipedia.org/wiki/Microsoft_Office_Document_Imaging
Dim inputFile As String = "C:\test\multipage.tif"
Dim strRecText As String = ""
Dim Doc1 As MODI.Document

Doc1 = New MODI.Document
Doc1.Create(inputFile)
Doc1.OCR()  ' this will ocr all pages of a multi-page tiff file
Doc1.Save() ' this will save the deskewed reoriented images, and the OCR text, back to the inputFile

For imageCounter As Integer = 0 To (Doc1.Images.Count - 1) ' work your way through each page of results
   strRecText &= Doc1.Images(imageCounter).Layout.Text    ' this puts the ocr results into a string
Next

File.AppendAllText("C:\test\testmodi.txt", strRecText)     ' write the OCR file out to disk

Doc1.Close() ' clean up
Doc1 = Nothing

其他类似 Tesseract ,但我对此有直接的经验.我听说过它的好与坏,所以我想这很大程度上取决于您的信号源质量.

Others like Tesseract, but I have direct experience with it. I've heard both good and bad things about it, so I imagine it greatly depends on your source quality.

这篇关于以编程方式从PDF文件中的扫描识别文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆