如何提取从PDF解码字符的文本？ [英] How to extract text from a PDF and decode characters?

查看：702 发布时间：2016/10/10 20:28:32 c# pdf itextsharp

本文介绍了如何提取从PDF解码字符的文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用iTextSharp的提取使用此代码的PDF文档中的文本：

I am using itextsharp to extract text from a pdf document using this code:

public static bool does_document_text_have_keyword(string keyword, 
                       string pdf_src, Report report_object)  // TEST
{
    try
    {
        PdfReader pdfReader = new PdfReader(pdf_src);
        string currentText;
        int count = pdfReader.NumberOfPages;
        for (int page = 1; page <= count; page++)
        {
           ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
           currentText = PdfTextExtractor.GetTextFromPage
                           (pdfReader, page, strategy);
           currentText = Encoding.UTF8.GetString
                           (ASCIIEncoding.Convert
                             (Encoding.Default,                                 
                              Encoding.UTF8, 
                              Encoding.Default.GetBytes(currentText)));

           report_object.log(currentText);  // TEST

           if (currentText.IndexOf
                (keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
        }
        pdfReader.Close();
        return false;
    }
    catch
    {
        return false;
    }
}

但问题是，当我提取文本中，文本没有空格，就好像白色的空间已经被替换为空字符串。然而，在PDF文档中，也有它的空格。有谁知道发生了什么吗？

But the problem is, when I extract text, the text has no white spaces, it's as if the white spaces has been replaced with an empty string. Yet in the pdf document, there are white spaces in it. Does anyone know whats happening here?

如何提取从PDF解码字符的文本？ [英] How to extract text from a PDF and decode characters?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

如何提取从PDF解码字符的文本？ [英] How to extract text from a PDF and decode characters?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭