C# 使用 PdfSharp 从 PDF 中提取文本 [英] C# Extract text from PDF using PdfSharp
本文介绍了C# 使用 PdfSharp 从 PDF 中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
是否有可能使用 PdfSharp 从 PDF 文件中提取纯文本?我不想使用 iTextSharp 因为它的许可证.
Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.
推荐答案
参考了 Sergio 的回答,做了一些扩展方法.我也把字符串的累加改成了迭代器.
Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.
public static class PdfSharpExtensions
{
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text;
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
}
这篇关于C# 使用 PdfSharp 从 PDF 中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文