使用iTextSharp解析PDF,然后将特定文本提取到屏幕 [英] parse PDF with iTextSharp and then extract specific text to the screen

查看:316
本文介绍了使用iTextSharp解析PDF,然后将特定文本提取到屏幕的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我试图从PDF文件中提取某些内容。所以这是一张发票,我希望能够在PDF文件中搜索单词Invoice Number:然后再搜索First Name并在

So I am trying to extract from the PDF file certain content. So it is an invoice, I want to be able to search the PDF file for the word "Invoice Number:" and then "First Name" and extract them in the

中提取它们

Console.WriteLine();

Console.WriteLine();

所以目前这是我得到的,我需要计算如何进一步行动。

So at the moment this is what I got and I need to figure out how to move further.

using iTextSharp.text.pdf;
using System.IO;
using iTextSharp.text.pdf.parser;
using System;

namespace PdfProperties
{
    class Program
    {
        static void Main(string[] args)
        {
            PdfReader reader = new PdfReader("C:/PDF/invoiceDetail.pdf");
            PdfReaderContentParser parser = new PdfReaderContentParser(reader);
            FileStream fs = new FileStream("C:/PDF/result0.txt", FileMode.Create);
            StreamWriter sw = new StreamWriter(fs);

            SimpleTextExtractionStrategy strategy;

            string text = "";

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                strategy = parser.ProcessContent(i, new SimpleTextExtractionStrategy());
                sw.WriteLine(strategy.GetResultantText());

                text = strategy.GetResultantText();

                String[] splitText = text.Split(new char[] {'.' });

                Console.WriteLine("Test");

                Console.WriteLine(text);
            }
            sw.Flush();
            sw.Close();

        }
    }
}

任何帮助非常感谢

推荐答案

有两种解决方法:


  1. 您可以尝试自行处理发票。这意味着处理结构,处理边缘情况。如果内容并非始终以相同方式对齐,该怎么办?如果发票模板发生变化怎么办?如果发票中的某些文本是可变的并且您不能真正依赖于提取的精确文本,该怎么办? ..

  1. You can try to process the invoice yourself. That means handling structure, and dealing with edge-cases. What if the content isn't always aligned in the same way? What if the template of the invoice changes? What if some text in the invoice is variable and you can't really rely on the precise text being extracted? ..

简而言之,这不是一个需要解决的小问题。

This is, in short, not a trivial problem to solve.

使用pdf2Data。它专门用于处理结构丰富的文档。像发票一样。它使用一个名为选择器的概念,允许您定义您希望某些内容的位置。通过位置(由坐标定义的矩形中的某个位置)或结构块(来自此表的行)等。

Use pdf2Data. It was specifically designed to handle documents that are rich in structure. Like invoices. It uses a concept called "selectors" that allow you to define where you expect certain content to be. Either by position (somewhere in the rectangle defined by coordinates ..) or by structural blocks (row .. from this table) etc.

即使加载项是关闭源代码,您可以随时使用试用许可证进行试用。在评估pdf2Data之后,您至少可以做出更明智的决定,决定您愿意采取哪种方式来解决这个问题。

Even though the add-on is closed source, you can always try it out by using a trial-license. After evaluating pdf2Data, you can at least make a more informed decision about which route you're willing to take to tackle this problem.

查看 itextpdf.com/itext7/pdf2Data 了解更多信息

这篇关于使用iTextSharp解析PDF,然后将特定文本提取到屏幕的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆