如何提取从PDF解码字符的文本? [英] How to extract text from a PDF and decode characters?

查看:702
本文介绍了如何提取从PDF解码字符的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用iTextSharp的提取使用此代码的PDF文档中的文本:

I am using itextsharp to extract text from a pdf document using this code:

public static bool does_document_text_have_keyword(string keyword, 
                       string pdf_src, Report report_object)  // TEST
{
    try
    {
        PdfReader pdfReader = new PdfReader(pdf_src);
        string currentText;
        int count = pdfReader.NumberOfPages;
        for (int page = 1; page <= count; page++)
        {
           ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
           currentText = PdfTextExtractor.GetTextFromPage
                           (pdfReader, page, strategy);
           currentText = Encoding.UTF8.GetString
                           (ASCIIEncoding.Convert
                             (Encoding.Default,                                 
                              Encoding.UTF8, 
                              Encoding.Default.GetBytes(currentText)));

           report_object.log(currentText);  // TEST

           if (currentText.IndexOf
                (keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
        }
        pdfReader.Close();
        return false;
    }
    catch
    {
        return false;
    }
}



但问题是,当我提取文本中,文本没有空格,就好像白色的空间已经被替换为空字符串。然而,在PDF文档中,也有它的空格。有谁知道发生了什么吗?

But the problem is, when I extract text, the text has no white spaces, it's as if the white spaces has been replaced with an empty string. Yet in the pdf document, there are white spaces in it. Does anyone know whats happening here?

推荐答案

我相信你的问题是SimpleTextExtractionStrategy。从 HTTP API文档://api.itextpdf。 COM / iText的/ COM / itextpdf /文本/ PDF /分析器/ SimpleTextExtractionStrategy.html

I believe your issue is the SimpleTextExtractionStrategy. From the API documentation at http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html

如果该PDF渲染文本非高端到低端的时尚,这将导致不被它是如何出现在PDF真实地再现文字。这也渲染器采用基于字体度量一个简单的策略,以确定是否一个空格应该被插入到输出。

If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.

尝试使用LocationTextExtractionStrategy。它的文档状态:

Try using the LocationTextExtractionStrategy. It's documentation states:

一个文本提取渲染器跟踪文本相对位置的页面生成的文本将与实物相对一致的布局大多数PDF文件都在屏幕上。

A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

这篇关于如何提取从PDF解码字符的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆