如何使用iTextSharp 4.1.6提取文本? [英] How to extract text with iTextSharp 4.1.6?

查看:128
本文介绍了如何使用iTextSharp 4.1.6提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

iTextSharp 4.1.6是LGPL许可的最新版本,可以免费用于商业目的而无需支付许可费用.

iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees.

对于某些人和我来说,如何使用此版本提取文本可能很有趣.

It might be interesting for some and for me, how to extract text with this version.

有人有主意吗?

推荐答案

我不得不像和您一样在同一条船上手动将它们砍在一起.希望这会有所帮助.这可能并不完美,但是我可以通过这种方式从文档中获取所需的文本. fileName是PDF文件的字符串变量/参数.

I had to hack this together manually as I was in the same boat as you. Hopefully this well help. It's probably not perfect, but I was able to get the text I needed out of the document this way. fileName is a string variable/parameter to the PDF file.

var reader = new PdfReader(fileName);

StringBuilder sb = new StringBuilder();

try
{
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var cpage = reader.GetPageN(page);
        var content = cpage.Get(PdfName.CONTENTS);

        var ir = (PRIndirectReference)content;

        var value = reader.GetPdfObject(ir.Number);

        if (value.IsStream())
        {
            PRStream stream = (PRStream)value;

            var streamBytes = PdfReader.GetStreamBytes(stream);

            var tokenizer = new PRTokeniser(new RandomAccessFileOrArray(streamBytes));

            try
            {
                while (tokenizer.NextToken())
                {
                    if (tokenizer.TokenType == PRTokeniser.TK_STRING)
                    {
                        string str = tokenizer.StringValue;
                        sb.Append(str);
                    }
                }
            }
            finally
            {
                tokenizer.Close();
            }
        }
    }
}
finally
{
    reader.Close();
}

return sb.ToString();

这篇关于如何使用iTextSharp 4.1.6提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆