OCR到C＃.net中的文本文件 [英] OCR To Text File in C#.net

查看：240 发布时间：2019/6/13 14:58:20 C#3.0 C#

本文介绍了OCR到C＃.net中的文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何将PDF文件数据提取到C＃.net中的文本文件中？文件内容是光学字符识别格式。

解决方案

我不确定这是否会对你有帮助... 
我在另一个网站上发现了这一点。也尝试下面提供的链接。 
 
 
从PDF文档中提取所有安装了adobe acrobat的文字，如标准版本9（可以使用早期版本，但我没有针对早期版本进行测试）
 
这里有几个可用的c＃成员，只要你添加了对你的项目acrobat.dll的引用并使用Acrobat添加;到你的班级：

  //  以下内容将允许通过pdf文件规范进行单词提取 
  //  打开pdf文件相当粗糙，需要更加健全 
  public   static   string  getTextFromPDF（ string  filespec）
 {
 Acrobat.AcroAppClass gAppClass =  new  Acrobat.AcroAppClass（）; 
 Acrobat.AcroAVDoc avDoc =（Acrobat.AcroAVDoc）gAppClass.GetInterface（  Acrobat.AcroAVDoc）;  //  带有UI窗口的可见pdf文档 
 avDoc.Open（System.IO.Path .GetFullPath（filespec），System.IO.Path.GetFileName（filespec））; 
 
 AcroPDDoc doc =（AcroPDDoc）avDoc.GetPDDoc（）; 
  string  txt = PdDocGetText（doc）; 
 doc.Close（）; 
 avDoc.Close（ 1 ）; 
 gAppClass.Exit（）; 
  return  txt; 
} 
  //   adobe论坛中的帖子略有修改版本最初由Eldrarak82  
  private   static   string  PdDocGetText（AcroPDDoc pdDoc）
 {
 AcroPDPage页面; 
  int  pages = pdDoc.GetNumPages（）; 
  string  pageText =  ; 
  for （ int  i =  0 ; i <  pages; i ++）
 {
 page =（AcroPDPage）pdDoc.AcquirePage（i）; 
  object  jso，jsNumWords，jsWord; 
 List< string> words =  new  List< string>（）; 
  try  
 {
 jso = pdDoc.GetJSObject（）; 
  if （jso！=  null ）
 {
  object  [] args =  new   object  [] {一世 }; 
 jsNumWords = jso.GetType（）。InvokeMember（  getPageNumWords，System.Reflection。 BindingFlags.InvokeMethod， null ，jso，args， null ）; 
  int  numWords =  Int32  .Parse（jsNumWords.ToString（））; 
  for （ int  j =  0 ; j <  = numWords; j ++）
 {
  object  [ ] argsj =  new   object  [] {i，j，假的}; 
 jsWord = jso.GetType（）。InvokeMember（  getPageNthWord，System.Reflection。 BindingFlags.InvokeMethod， null ，jso，argsj， null ）; 
 words.Add（（ string ）jsWord）; 
} 
} 
  foreach （ string  word  in  words）
 {
 pageText + = word; 
} 
} 
  catch  
 {
} 
} 
  return  pageText; 
} 
 < /   string  >  < /   string  > ;

上述代码示例尚未经过全面测试，可能需要改进。尽管如此，这是一个很好的起点。 
 
对于那些对表，行和列感兴趣的人，可以通过adobe查找doucments，例如
 
 
 http://wwwimages.adobe.com/www.adobe .com / content / dam / Adobe / en / devnet / a crobat / pdfs / plugin_apps_developer_guide.pdf 
 
 about page 130ish to 136 
 
 link 
 
 http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf 
 
也可能对我们有所帮助其他任务。

你好，

以下是一些可以帮助你入门的文章。 br $> b $ b

将PDF转换为C＃中的文本 [ ^ ]

PDF2Text Pilot（.NET） [ ^ ]

使用iTextSharp从PDF文件中提取文本 [ ^ ]

iTextSharp - 几个C＃示例。 [ ^ ]

用C＃（100％.NET）从PDF中提取文本 [ ^ ]

问候，

How to extract PDF File data into text File in C#.net ? The file contents are Optical Character Recognition format.

解决方案

Am not sure if this will help you...
I found this in another web site..Also try the links provided below.


to extract all words from a PDF document for those with adobe acrobat installed like the standarard version 9 ( may work with earlier version but I have not tested against earlier version)

here are a couple of c# member available that will compile provided you have added reference to your project acrobat.dll and added using Acrobat; to your class:

// the following will allow word extraction by pdf file spec
// opening the pdf document is rather crude and need to be more robust
 public static string getTextFromPDF(string filespec)
 {
  Acrobat.AcroAppClass gAppClass = new Acrobat.AcroAppClass();
  Acrobat.AcroAVDoc avDoc = (Acrobat.AcroAVDoc)gAppClass.GetInterface("Acrobat.AcroAVDoc"); //Visible pdf document with a UI Window
  avDoc.Open(System.IO.Path.GetFullPath(filespec), System.IO.Path.GetFileName(filespec));
   
  AcroPDDoc doc = (AcroPDDoc)avDoc.GetPDDoc();
  string txt = PdDocGetText(doc);
  doc.Close();
  avDoc.Close(1);
  gAppClass.Exit();
  return txt;
 }
// slightly modified version of a post in adobe forum by originally by Eldrarak82
 private static string PdDocGetText(AcroPDDoc pdDoc)
 {
  AcroPDPage page;
  int pages = pdDoc.GetNumPages();
  string pageText = "";
  for (int i = 0; i < pages; i++)
  {
   page = (AcroPDPage)pdDoc.AcquirePage(i);
   object jso, jsNumWords, jsWord;
   List<string> words = new List<string>();
   try
   {
    jso = pdDoc.GetJSObject();
    if (jso != null)
    {
     object[] args = new object[] { i };
     jsNumWords = jso.GetType().InvokeMember("getPageNumWords", System.Reflection.BindingFlags.InvokeMethod, null, jso, args, null);
     int numWords = Int32.Parse(jsNumWords.ToString());
     for (int j = 0; j <= numWords; j++)
     {
      object[] argsj = new object[] { i, j, false };
      jsWord = jso.GetType().InvokeMember("getPageNthWord", System.Reflection.BindingFlags.InvokeMethod, null, jso, argsj, null);
      words.Add((string)jsWord);
     }
    }
    foreach (string word in words)
    {
     pageText += word;
    }
   }
   catch
   {
   }
  }
  return pageText;
 }
</string></string>

the above code sample has yet to be fully tested and may need improvement. nonetheless it is a good starting point.

for those interested in tables, rows and columns, look up the doucments by adobe like


http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/a crobat/pdfs/plugin_apps_developer_guide.pdf

around page 130ish to 136

the link

http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf

may also be helpfull for a lot other tasks.

Hello,

Here are some articles which should get you started.

Converting PDF to Text in C#[^]
PDF2Text Pilot (.NET)[^]
Using iTextSharp to Extract Text from PDF files[^]
iTextSharp — few C# examples.[^]
Extract Text from PDF in C# (100% .NET)[^]

Regards,

这篇关于OCR到C＃.net中的文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

OCR到C＃.net中的文本文件 [英] OCR To Text File in C#.net

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

OCR到C＃.net中的文本文件 [英] OCR To Text File in C#.net

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭