如何将pdf文件转换为excel在c# [英] How to convert pdf file to excel in c#

查看:206
本文介绍了如何将pdf文件转换为excel在c#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从PDF表格中提取一些数据,如电子邮件地址 ..并使用我提取的电子邮件地址发送电子邮件给那些人

I want to extract some data like " email addresses " .. from table which are in PDF file and use this email addresses which I extract to send email to those people.

目前为止,我通过搜索网络发现了以下内容:

What I have found so far through searching the web:


  1. 我必须将PDF文件转换为Excel,轻松读取数据,并按需要使用它们。

  1. I have to convert the PDF file to Excel to read the data easily and use them as I want.

我发现一些免费的DLL,如 itextsharp PDFsharp

I find some free dll like itextsharp or PDFsharp.

但是我没有在C#中找到任何代码段帮助。有什么解决方案吗?

But I didn't find any snippet code help to do this in C#. is there any solution ?

推荐答案

您绝对不必将PDF转换为Excel。
首先,请确定您的PDF是否包含文本数据,或者是扫描图像。
如果它包含文本数据,那么你使用一些免费的dll是正确的。我推荐iTextSharp,因为它是受欢迎和易于使用。

You absolutely do not have to convert PDF to Excel. First of all, please determine whether your PDF contains textual data, or it is scanned image. If it contains textual data, then you are right about using "some free dll". I recommend iTextSharp as it is popular and easy to use.

现在有争议的部分。如果您不需要坚如磐石的解决方案,最简单的方法是将所有PDF读取到字符串,然后使用正则表达式检索电子邮件。

这是使用iTextSharp读取PDF并提取的示例(不完美)电子邮件:

Now the controversial part. If you don't need rock solid solution, it would be easiest to read all PDF to a string and then retrieve emails using regular expression.
Here is example (not perfect) of reading PDF with iTextSharp and extracting emails:

public string PdfToString(string fileName)
{
    var sb = new StringBuilder();    
    var reader = new PdfReader(fileName);
    for (int page = 1; page <= reader.NumberOfPages; page++)
    {
        var strategy = new SimpleTextExtractionStrategy();
        string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
        text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
        sb.Append(text);
    }
    reader.Close();        
    return sb.ToString();
}
//adjust expression as needed
Regex emailRegex = new Regex("Email Address (?<email>.+?) Passport No");
public IEnumerable<string> ExtractEmails(string content)
{   
    var matches = emailRegex.Matches(content);
    foreach (Match m in matches)
    {
        yield return m.Groups["email"].Value;
    }
}

这篇关于如何将pdf文件转换为excel在c#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆