使用C#将Pdf表格式转换为Excel格式....这是代码提取仅文本,但我想转换为表格格式也Plz帮助我 [英] Convert Pdf Table Format To Excel Format Using C#....This Is Code Extract Only Text But I Want To Transfer As Table Format Also Plz Help Me

查看:292
本文介绍了使用C#将Pdf表格式转换为Excel格式....这是代码提取仅文本,但我想转换为表格格式也Plz帮助我的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

string [] words;

private void ExportPDFToExcel(string fileName)

{

StringBuilder text = new StringBuilder();

PdfReader pdfReader = new PdfReader(fileName);



for(int page = 1; page< = pdfReader.NumberOfPages; page ++)

{

ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();

string currentText = PdfTextExtractor.GetTextFromPage(pdfReader,page,strategy);

words = currentText.Split('\ n');

for(int j = 0,len = words.Length; j< len; j ++)

{

currentText = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words [j]));

text.Append(currentText + Environment.NewLine) ;

pdfReader.Close();



}

FileStream fs1 =新FileStream(D:\\Yourfile.txt,FileMode.OpenOrCreate,FileAccess.Write );

StreamWriter writer = new StreamWriter(fs1);

writer.Write(text);

writer.Close();



StreamReader objReader = new StreamReader(@D:\\Yourfile.txt);

string sLine =;

ArrayList arrText = new ArrayList();

while(sLine!= null)

{

sLine = objReader。 ReadLine();

if(sLine!= null)

arrText.Add(sLine);

}

callExcel(arrText,false);

}

}



string[] words;
private void ExportPDFToExcel(string fileName)
{
StringBuilder text = new StringBuilder();
PdfReader pdfReader = new PdfReader(fileName);

for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
words = currentText.Split('\n');
for (int j = 0, len = words.Length; j < len; j++)
{
currentText = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
text.Append(currentText + Environment.NewLine);
pdfReader.Close();

}
FileStream fs1 = new FileStream("D:\\Yourfile.txt", FileMode.OpenOrCreate, FileAccess.Write);
StreamWriter writer = new StreamWriter(fs1);
writer.Write(text);
writer.Close();

StreamReader objReader = new StreamReader(@"D:\\Yourfile.txt");
string sLine = "";
ArrayList arrText = new ArrayList();
while (sLine != null)
{
sLine = objReader.ReadLine();
if (sLine != null)
arrText.Add(sLine);
}
callExcel(arrText, false);
}
}

private void button1_Click(object sender, EventArgs e)
       {
           string file = Path.GetFullPath(@"C:\Users\karthi\Desktop\ast_sci_data_tables_sample.pdf");
           this.ExportPDFToExcel(file);
       }

推荐答案

抱歉,PDF没有表格式,没有cell / row / column / header /的概念页脚...



您看到的大多数表格都是由正确位置打印的文本块组成,看起来像表格中的单元格。您可以做的最好的事情是提取特定PDF的文本和元数据(字体,位置,...),并使用启发式方法重新创建表结构。这不是一般的解决方案,必须对要提取的每个表进行审核。
Sorry but PDF does not have a table format, no concept of cell/row/column/header/footer...

Most of the tables you see are made of block of text that are "printed" in the right position to look like cells in a table. The best you can do is extract the text & metadata (font, position, ...) of your particular PDF and use heuristics to recreate the table structure. This is NOT a generic solution and has to br reviewed for every table you want to extract.


这篇关于使用C#将Pdf表格式转换为Excel格式....这是代码提取仅文本,但我想转换为表格格式也Plz帮助我的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆