OCR到C#.net中的文本文件 [英] OCR To Text File in C#.net

查看:240
本文介绍了OCR到C#.net中的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将PDF文件数据提取到C#.net中的文本文件中?文件内容是光学字符识别格式。

解决方案

我不确定这是否会对你有帮助... 
我在另一个网站上发现了这一点。也尝试下面提供的链接。


从PDF文档中提取所有安装了adobe acrobat的文字,如标准版本9(可以使用早期版本,但我没有针对早期版本进行测试)

这里有几个可用的c#成员,只要你添加了对你的项目acrobat.dll的引用并使用Acrobat添加;到你的班级:



  //  以下内容将允许通过pdf文件规范进行单词提取 
// 打开pdf文件相当粗糙,需要更加健全
public static string getTextFromPDF( string filespec)
{
Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
Acrobat.AcroAVDoc avDoc =(Acrobat.AcroAVDoc)gAppClass.GetInterface( Acrobat.AcroAVDoc); // 带有UI窗口的可见pdf文档
avDoc.Open(System.IO.Path .GetFullPath(filespec),System.IO.Path.GetFileName(filespec));

AcroPDDoc doc =(AcroPDDoc)avDoc.GetPDDoc();
string txt = PdDocGetText(doc);
doc.Close();
avDoc.Close( 1 );
gAppClass.Exit();
return txt;
}
// adobe论坛中的帖子略有修改版本最初由Eldrarak82
private static string PdDocGetText(AcroPDDoc pdDoc)
{
AcroPDPage页面;
int pages = pdDoc.GetNumPages();
string pageText = ;
for int i = 0 ; i < pages; i ++)
{
page =(AcroPDPage)pdDoc.AcquirePage(i);
object jso,jsNumWords,jsWord;
List< string> words = new List< string>();
try
{
jso = pdDoc.GetJSObject();
if (jso!= null
{
object [] args = new object [] {一世 };
jsNumWords = jso.GetType()。InvokeMember( getPageNumWords,System.Reflection。 BindingFlags.InvokeMethod, null ,jso,args, null );
int numWords = Int32 .Parse(jsNumWords.ToString());
for int j = 0 ; j < = numWords; j ++)
{
object [ ] argsj = new object [] {i,j,假的};
jsWord = jso.GetType()。InvokeMember( getPageNthWord,System.Reflection。 BindingFlags.InvokeMethod, null ,jso,argsj, null );
words.Add(( string )jsWord);
}
}
foreach string word in words)
{
pageText + = word;
}
}
catch
{
}
}
return pageText;
}
< / string > < / string > ;



上述代码示例尚未经过全面测试,可能需要改进。尽管如此,这是一个很好的起点。 

对于那些对表,行和列感兴趣的人,可以通过adobe查找doucments,例如


http://wwwimages.adobe.com/www.adobe .com / content / dam / Adob​​e / en / devnet / a crobat / pdfs / plugin_apps_developer_guide.pdf

about page 130ish to 136

link

http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf

也可能对我们有所帮助其他任务。


你好,



以下是一些可以帮助你入门的文章。 br $> b $ b

问候,

How to extract PDF File data into text File in C#.net ? The file contents are Optical Character Recognition format.

解决方案

Am not sure if this will help you...
I found this in another web site..Also try the links provided below.


to extract all words from a PDF document for those with adobe acrobat installed like the standarard version 9 ( may work with earlier version but I have not tested against earlier version)

here are a couple of c# member available that will compile provided you have added reference to your project acrobat.dll and added using Acrobat; to your class:


// the following will allow word extraction by pdf file spec
// opening the pdf document is rather crude and need to be more robust
 public static string getTextFromPDF(string filespec)
 {
  Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
  Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window
  avDoc.Open(System.IO.Path.GetFullPath(filespec), System.IO.Path.GetFileName(filespec));
   
  AcroPDDoc doc = (AcroPDDoc)avDoc.GetPDDoc();
  string txt = PdDocGetText(doc);
  doc.Close();
  avDoc.Close(1);
  gAppClass.Exit();
  return txt;
 }
// slightly modified version of a post in adobe forum by originally by Eldrarak82
 private static string PdDocGetText(AcroPDDoc pdDoc)
 {
  AcroPDPage page;
  int pages = pdDoc.GetNumPages();
  string pageText = "";
  for (int i = 0; i < pages; i++)
  {
   page = (AcroPDPage)pdDoc.AcquirePage(i);
   object jso, jsNumWords, jsWord;
   List<string> words = new List<string>();
   try
   {
    jso = pdDoc.GetJSObject();
    if (jso != null)
    {
     object[] args = new object[] { i };
     jsNumWords = jso.GetType().InvokeMember("getPageNumWords", System.Reflection.BindingFlags.InvokeMethod, null, jso, args, null);
     int numWords = Int32.Parse(jsNumWords.ToString());
     for (int j = 0; j <= numWords; j++)
     {
      object[] argsj = new object[] { i, j, false };
      jsWord = jso.GetType().InvokeMember("getPageNthWord", System.Reflection.BindingFlags.InvokeMethod, null, jso, argsj, null);
      words.Add((string)jsWord);
     }
    }
    foreach (string word in words)
    {
     pageText += word;
    }
   }
   catch
   {
   }
  }
  return pageText;
 }
</string></string>


the above code sample has yet to be fully tested and may need improvement. nonetheless it is a good starting point.

for those interested in tables, rows and columns, look up the doucments by adobe like


http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/a crobat/pdfs/plugin_apps_developer_guide.pdf

around page 130ish to 136

the link

http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf

may also be helpfull for a lot other tasks.


Hello,

Here are some articles which should get you started.

Regards,


这篇关于OCR到C#.net中的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆