这是文本提取的战略正确的方法? [英] Which is the right method to text extraction strategy?

查看:187
本文介绍了这是文本提取的战略正确的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的C#代码我从PDF提取文本,我也这样做的两种方法。然而1方法适用于1型PDF文档和其他方法适用于其他类型的PDF文档。

In my c# code I am extracting text from a pdf, and I have two methods of doing it. However 1 method works for 1 type of pdf document and the other method works for the other type of pdf document.

在方法1失败,我得到的文本,但没有任何空格,当方法2失败,我只得到\r\\\

When method 1 fails, I get the text but without any whitespaces, and when method 2 fails, I get only \r\n.

方法1(类从的 http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF -in-C-100-NET

PDFParser pdf_parser = new PDFParser();
currentText = pdf_parser.ExtractTextFromPDFBytes(pdfReader.GetPageContent(page)) + " ";



方法2

Method 2

StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
    output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
currentText = output.ToString();



有没有办法来两个功能结合起来,因此总是工作?

Is there a way to combine both functions so it always works?

推荐答案

关于方法1 :从CodeProject上的文章的PdfParser只能在特殊情况下

Concerning Method 1: The PdfParser from that codeproject article only works in special situations.

它假定所有的文本内容包含在眼前网页内容流(S)英寸其实这些流可以包括它们本身包含文本的资源引用。这尤其是经常正达文件的情况下,但它可能在任何文件中出现。

It assumes that all the text content is contained in the immediate page content stream(s). Actually these streams may include references to resources which themselves contain text. This especially is often the case for n-up documents but it may happen in any document.

此外,它假定一些的Latin1般的字符编码​​。这往往是在欧洲语言文本的情况下(只是常常!)但是在许多亚洲语言的情况下,这几乎没有造成什么明智的。

Furthermore it assumes some Latin1-like character encoding. This is often the case for text in European languages (merely often!) but in case of many Asian languages this hardly ever results in something sensible.

此外它解释所有字距。差距,空格字符

Additionally it interprets all kerning gaps as space characters.

关于方法2 :作为在评论中提及到你的前一个问题的如何提取从PDF解码字符的文本?你可能想看看在这个答案以类似的问题。

Concerning Method 2: As mentioned in a comment to your former question How to extract text from a PDF and decode characters? you might want to have a look at this answer to a similar problem.

从本质上对这种缺失的空格字符的原因是,你在渲染PDF看到的空间并不一定对应于PDF页面内容描述一个空格字符。相反,你经常会发现在PDF中,解析一个字后呈现下一个字前稍微移动当前位置到右侧的操作。

Essentially the reason for such missing space characters is that the space you see in the rendered PDF does not necessarily correspond to a space character in the page content description of the PDF. Instead you often find an operation in PDFs which after rendering one word moves the current position slightly to the right before rendering the next word.

不幸的是,同样的机制也被用于加强相邻字形的外观:在某些字母组合,有一个良好的外观和阅读体验字形应该彼此印相互靠近或更远低于他们将在默认情况下。这在使用与上述相同的操作的PDF完成

Unfortunately the same mechanism also is used to enhance the appearance of adjacent glyphs: In some letter combinations, for a good appearance and reading experience the glyphs should be printed nearer to each other or farther from each other than they would be by default. This is done in PDFs using the same operation as above.

因此​​,在这种情况下,一个PDF解析器使用试探法确定这种转变是否意味着暗示一空格字符,或者是否只是为了让信集团好看。和启发式可能失败。

Thus, a PDF parser in such situations has to use heuristics to decide whether such a shift was meant to imply a space character or whether it was merely meant to make the letter group look good. And heuristics can fail.

引用的答案表明如何调整这些启发式,问题的楼主据此找到了成功解析他的PDF文件一个很好的解决方案。

The answer referenced indicates how to tweak these heuristics, and the original poster of the question accordingly found a good solution for parsing his PDFs successfully.

如果你想要一些的最终的解决方案,以你的问题,你必须在你观察到的问题更好的供应样品PDF文件。

If you want some final solution to your problem, you had better supply sample PDFs in which you observed that issue.

这篇关于这是文本提取的战略正确的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆