OCR到C#.net中的文本文件 [英] OCR To Text File in C#.net
本文介绍了OCR到C#.net中的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何将PDF文件数据提取到C#.net中的文本文件中?文件内容是光学字符识别格式。
解决方案
我不确定这是否会对你有帮助...
我在另一个网站上发现了这一点。也尝试下面提供的链接。
从PDF文档中提取所有安装了adobe acrobat的文字,如标准版本9(可以使用早期版本,但我没有针对早期版本进行测试)
这里有几个可用的c#成员,只要你添加了对你的项目acrobat.dll的引用并使用Acrobat添加;到你的班级:
// 以下内容将允许通过pdf文件规范进行单词提取
// 打开pdf文件相当粗糙,需要更加健全
public static string getTextFromPDF( string filespec)
{
Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
Acrobat.AcroAVDoc avDoc =(Acrobat.AcroAVDoc)gAppClass.GetInterface( Acrobat.AcroAVDoc); // 带有UI窗口的可见pdf文档
avDoc.Open(System.IO.Path .GetFullPath(filespec),System.IO.Path.GetFileName(filespec));
AcroPDDoc doc =(AcroPDDoc)avDoc.GetPDDoc();
string txt = PdDocGetText(doc);
doc.Close();
avDoc.Close( 1 );
gAppClass.Exit();
return txt;
}
// adobe论坛中的帖子略有修改版本最初由Eldrarak82
private static string PdDocGetText(AcroPDDoc pdDoc)
{
AcroPDPage页面;
int pages = pdDoc.GetNumPages();
string pageText = ;
for ( int i = 0 ; i < pages; i ++)
{
page =(AcroPDPage)pdDoc.AcquirePage(i);
object jso,jsNumWords,jsWord;
List< string> words = new List< string>();
try
{
jso = pdDoc.GetJSObject();
if (jso!= null )
{
object [] args = new object [] {一世 };
jsNumWords = jso.GetType()。InvokeMember( getPageNumWords,System.Reflection。 BindingFlags.InvokeMethod, null ,jso,args, null );
int numWords = Int32 .Parse(jsNumWords.ToString());
for ( int j = 0 ; j < = numWords; j ++)
{
object [ ] argsj = new object [] {i,j,假的};
jsWord = jso.GetType()。InvokeMember( getPageNthWord,System.Reflection。 BindingFlags.InvokeMethod, null ,jso,argsj, null );
words.Add(( string )jsWord);
}
}
foreach ( string word in words)
{
pageText + = word;
}
}
catch
{
}
}
return pageText;
}
< / string > < / string > ;
上述代码示例尚未经过全面测试,可能需要改进。尽管如此,这是一个很好的起点。
对于那些对表,行和列感兴趣的人,可以通过adobe查找doucments,例如
http://wwwimages.adobe.com/www.adobe .com / content / dam / Adobe / en / devnet / a crobat / pdfs / plugin_apps_developer_guide.pdf
about page 130ish to 136
link
http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf
也可能对我们有所帮助其他任务。
你好,
以下是一些可以帮助你入门的文章。 br $> b $ b
- 将PDF转换为C#中的文本 [ ^ ]
- PDF2Text Pilot(.NET) [ ^ ]
- 使用iTextSharp从PDF文件中提取文本 [ ^ ]
- iTextSharp - 几个C#示例。 [ ^ ]
- 用C#(100%.NET)从PDF中提取文本 [ ^ ]
问候,
How to extract PDF File data into text File in C#.net ? The file contents are Optical Character Recognition format.
解决方案
Am not sure if this will help you... I found this in another web site..Also try the links provided below. to extract all words from a PDF document for those with adobe acrobat installed like the standarard version 9 ( may work with earlier version but I have not tested against earlier version) here are a couple of c# member available that will compile provided you have added reference to your project acrobat.dll and added using Acrobat; to your class:
// the following will allow word extraction by pdf file spec // opening the pdf document is rather crude and need to be more robust public static string getTextFromPDF(string filespec) { Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass(); Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window avDoc.Open(System.IO.Path.GetFullPath(filespec), System.IO.Path.GetFileName(filespec)); AcroPDDoc doc = (AcroPDDoc)avDoc.GetPDDoc(); string txt = PdDocGetText(doc); doc.Close(); avDoc.Close(1); gAppClass.Exit(); return txt; } // slightly modified version of a post in adobe forum by originally by Eldrarak82 private static string PdDocGetText(AcroPDDoc pdDoc) { AcroPDPage page; int pages = pdDoc.GetNumPages(); string pageText = ""; for (int i = 0; i < pages; i++) { page = (AcroPDPage)pdDoc.AcquirePage(i); object jso, jsNumWords, jsWord; List<string> words = new List<string>(); try { jso = pdDoc.GetJSObject(); if (jso != null) { object[] args = new object[] { i }; jsNumWords = jso.GetType().InvokeMember("getPageNumWords", System.Reflection.BindingFlags.InvokeMethod, null, jso, args, null); int numWords = Int32.Parse(jsNumWords.ToString()); for (int j = 0; j <= numWords; j++) { object[] argsj = new object[] { i, j, false }; jsWord = jso.GetType().InvokeMember("getPageNthWord", System.Reflection.BindingFlags.InvokeMethod, null, jso, argsj, null); words.Add((string)jsWord); } } foreach (string word in words) { pageText += word; } } catch { } } return pageText; } </string></string>
the above code sample has yet to be fully tested and may need improvement. nonetheless it is a good starting point. for those interested in tables, rows and columns, look up the doucments by adobe like http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/a crobat/pdfs/plugin_apps_developer_guide.pdf around page 130ish to 136 the link http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf may also be helpfull for a lot other tasks.
Hello,
Here are some articles which should get you started.
- Converting PDF to Text in C#[^]
- PDF2Text Pilot (.NET)[^]
- Using iTextSharp to Extract Text from PDF files[^]
- iTextSharp — few C# examples.[^]
- Extract Text from PDF in C# (100% .NET)[^]
Regards,
这篇关于OCR到C#.net中的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文