如何在C#中从PDF中提取格式化文本 [英] How to extract formatted text from PDF in C#

查看:146
本文介绍了如何在C#中从PDF中提取格式化文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hello Experts,

我正在开发一个基于Web的应用程序,用户将通过该应用程序上传其PDF文档,我需要从该PDF中提取几个细节,在分析数据后,我将显示结果网页。我搜索了很多,发现了一些帮助我使用 iTextSharp PDFBox 以及 Codeproject stackoverflow 上提出的更多类似问题

不知何故,我逐页获得了文本,但它没有格式化,所以我无法对从pdf中提取的数据执行操作。有没有办法逐行逐行提取文字。



谢谢

Hello Experts,
I am developing a web based application through which user will upload its PDF document, i need to extract several details from that PDF and after analysing the data i will show the result on web page. I have googled a lot and found several article which helped me to extract text using iTextSharp, PDFBox and many more similar question asked on Codeproject and stackoverflow
Somehow i got the text page by page but it was not formatted so i could not perform operation on data extracted from pdf. Is there any way to extract text like line by line , column by column.

Thank you

推荐答案

public string ReadPdfFile(string path)
        {
            string result = "";
            StringBuilder text = new StringBuilder();

            PdfReader pdfReader = new PdfReader(path);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                result += PdfTextExtractor.  GetTextFromPage(pdfReader, page, strategy);

                //  result = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(result)));
                // text.Append(result);

            }

            pdfReader.Close();
            txtInput.Text = result;
            return result;
        }


这篇关于如何在C#中从PDF中提取格式化文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆