在C#中从PDF读取基于标签名称的特定值 [英] Read specific value based on label name from PDF in C#

查看:229
本文介绍了在C#中从PDF读取基于标签名称的特定值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个asp.net Core 2.0 C#应用程序,可以读取/解析P​​DF文件并获取文本.在此,我想读取具有特定标签名称的特定值.您可以看到下面的图片,我想获取值为171857的值,该值为Invoice数字并将其存储在数据库中.

I have an asp.net Core 2.0 C# application which read/parse the PDF file and get the text. In this I want to read specific value which have specific label name. You can see the below image I want to get the value 171857 which is Invoice number and store it in database.

我尝试使用下面的代码使用iTextSharp阅读pdf.

I have tried below code to read the pdf using iTextSharp.

using (PdfReader reader = new PdfReader(fileName))
        {
            StringBuilder sb = new StringBuilder();

            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            for (int page = 0; page < reader.NumberOfPages; page++)
            {
                string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
                if (!string.IsNullOrWhiteSpace(text))
                {
                    sb.Append(Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
                }
            }

            var pdfText = sb.ToString();
        }

pdfText变量中,我将从pdf中获取所有文本内容,但这似乎不是获取发票编号的正确方法.还有其他方法可以通过其标签名称从pdf读取pdf的特定内容,例如我们将提供标签名称Invoice并返回值171857作为其他第三方pdf阅读器库的示例?

In pdfText variable I will get all text content from pdf but It seems that this is not the proper way to get the Invoice number. Is there any other way to read the specific content from pdf by it's label name like we will provide label name Invoice and it will return the value 171857 as example with other 3rd party pdf reader libraries?

任何帮助或建议将不胜感激.

Any help or suggestions would be highly appreciated.

谢谢

推荐答案

我已经帮助一位朋友从Excel arc生成的pdf发票中提取了相似的值.为此,我创建了一个Excel发票并将其打印为PDF文件,并为压缩下载以进行测试.

I have helped a friend extracting similar value from pdf invoice generated by Excel arc. I have for this answer created an Excel invoice and print it as PDF file and zipped for download for testing purpose.

下一步,我正在使用一个名为 PDFClown .这是 nuget 软件包.

The next thing I do, I am using an Open Source and Free Library called PDFClown. Here is the nuget package for it.

到目前为止,我所做的是我扫描了所有pdf文档(例如发票可以是一页或多页),并将每个内容添加到字符串列表中.

So far so good, what I did is I scan all pdf document (for example invoice can be one page or multiple pages) add each content to a list of string.

下一步,我找到索引(该索引号可能在列表中的第10个元素中,在本例中为索引1),该索引引用了我称为标签"或标签"的发票值.

The next step I find the index (the invoice number index could be in 10th element in list, in our case it is index 1) that refer to invoice value which I will call Tag or Label.

因此,我没有您的pdf文件,我即兴添加了一个唯一的标签(或其他名称)" INVOICE ".在这种情况下,发票编号位于发票标记标签之后.因此,我找到"INVOICE"标签的索引,并在索引中添加1,这是因为发票编号紧随发票标签之后.这样,在这种情况下,我将选择发票文本0005并将其作为值5返回.这样,您可以获取每个文本/值以及列表中扫描的所有标签的内容,然后以所需的方式返回.

Hence I do not have your pdf file, I improvised and added a unique Tag called (or any other name) "INVOICE". The invoice number in this case comes after invoice tag tag. So I find the index of "INVOICE" tag and add 1 to index this is because the invoice number follow the invoice tag. This way I will pick the invoice text 0005 in this case and return it as value 5. This way you can fetch what every text/value followed by any tag scanned in our list and return it the way that you need.

因此,您需要对其进行一些操作,以使其100%适合您的pdf文件.

So you need to play with it a bit to fit it 100% to your pdf file.

这是我的测试文件Excel和Pdf,压缩.下载它进行测试.

So here is my test files Excel and Pdf zipped down. Download it for your test.

这是代码:

public class InvoiceTextExtraction
{
    private List<string> _contentList;

    public void GetValueFromPdf()
    {
        _contentList = new List<string>();
        CreatePdfContent(@"C:\temp\Invoice1.pdf");

        var index = _contentList.FindIndex(e => e == "INVOICE") + 1;
        int.TryParse(_contentList[index], out var value);
        Console.WriteLine(value);
    }


    public void CreatePdfContent(string filePath)
    {
        using (var file = new File(filePath))
        {
            var document = file.Document;

            foreach (var page in document.Pages)
            {
                Extract(new ContentScanner(page));
            }
        }
    }

    private void Extract(ContentScanner level)
    {
        if (level == null)
            return;

        while (level.MoveNext())
        {
            var content = level.Current;
            switch (content)
            {
                case ShowText text:
                {
                    var font = level.State.Font;
                    _contentList.Add(font.Decode(text.Text));
                    break;
                }
                case Text _:
                case ContainerObject _:
                    Extract(level.ChildLevel);
                    break;
            }
        }
    }
}

从pdf文件中提取的输入.代码扫描返回以下元素:

Input extracted from pdf file. The code scan return following elements:

INVOICE
0005

PAYMENT DUE BY:
4/19/2019
.etc
.
.
.
Tax
USD TOTAL
171857
18 september 2019

这是结果

5

该代码的灵感来自此链接.

这篇关于在C#中从PDF读取基于标签名称的特定值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆