如何使用itextsharp从PDF读取表格? [英] How to read table from PDF using itextsharp?
问题描述
我在从pdf文件中读取表时遇到问题。这是一个非常简单的pdf文件,包含一些文本和表格。我使用的工具是itextsharp。我知道PDF中没有表格概念。经过一些谷歌搜索后,有人说可能使用itextsharp + custom ITextExtractionStrategy实现这一目标。但我不知道如何开始它。有人可以给我一些提示吗?或一小段示例代码?
I am having an problem with reading a table from pdf file. It's a very simple pdf file with some text and a table. The tool i am using is itextsharp. I know there is no table concept in PDF. After some googling, someone said it might be possible to achieve that using itextsharp + custom ITextExtractionStrategy. But I have no idea how to start it. Can someone please give me some hints? or a small piece of sample code?
干杯
推荐答案
这代码用于读取表格内容。所有值都包含在()Tj中,所以我们查找所有值,你可以用字符串结果做任何事情。
This code is for reading a table content. all the values are enclosed by ()Tj, so we look for all the values, you can do anything then with the string resulst.
string _filePath = @"~\MyPDF.pdf";
public List<String> Read()
{
var pdfReader = new PdfReader(_filePath);
var pages = new List<String>();
for (int i = 0; i < pdfReader.NumberOfPages; i++)
{
string textFromPage = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));
pages.Add(GetDataConvertedData(textFromPage));
}
return pages;
}
string GetDataConvertedData(string textFromPage)
{
var texts = textFromPage.Split(new[] { "\n" }, StringSplitOptions.None)
.Where(text => text.Contains("Tj")).ToList();
return texts.Aggregate(string.Empty, (current, t) => current +
t.TrimStart('(')
.TrimEnd('j')
.TrimEnd('T')
.TrimEnd(')'));
}
这篇关于如何使用itextsharp从PDF读取表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!