如何使用iTextsharp在c#.net中逐行读取带有空格(实际上)的pdf文件 [英] how to read pdf file with blank spaces (as it is) line by Line in c#.net using iTextsharp

查看:170
本文介绍了如何使用iTextsharp在c#.net中逐行读取带有空格(实际上)的pdf文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iText(for .net)来阅读pdf文件。它读取文档但是当有空格时它只读取一个空格。

I am using iText (for .net) to read pdf files. It reads the document but when there are whitespaces it reads only one space.

这使得无法通过获取子字符串来提取数据。我想逐行读取数据与空格,所以我知道文本的实际位置,因为我想将数据写入数据库。

That makes it impossible to extract data by getting substrings. I want to read data line by line with whitespaces so I know the actual position of text because I want to write the data into a database.

该文件是银行对帐单,我想将其转储到数据库中以设计一个已对帐系统,

The file is a bank statement, I want to dump it into a database for designing a reconciled system,

这是一个文件的屏幕截图

Here is a screen shot of a file

以下是我正在使用的代码

Following is the code which I am using

            For page As Integer = 1 To pdfReader.NumberOfPages
            ' Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()

            Dim Strategy As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy()
            Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)
            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.[Default], Encoding.UTF8, Encoding.[Default].GetBytes(currentText)))


            Dim delimiterChars As Char() = {ControlChars.Lf}

            Dim lines As String() = currentText.Split(delimiterChars)

            Dim Bnk_Name As Boolean = True
            Dim Br_Name As Boolean = False
            Dim Name_acc As Boolean = False
            Dim statment As Boolean = False
            Dim Curr As Boolean = False
            Dim Open As Boolean = False
            Dim BankName = ""
            Dim Branch = ""
            Dim AccountNo = ""
            Dim CompName = ""
            Dim Currency = ""
            Dim Statement_from = ""
            Dim Statement_to = ""
            Dim Opening_Balance = ""
            Dim Closing_Balance = ""
            Dim Narration As String = ""
            For Each line As String In lines

                line.Trim()

                'BANK NAME
                If Bnk_Name Then
                    If line.Trim() <> "" Then
                        BankName = line.Substring(0, 21)
                        Bnk_Name = False
                    Else
                        Bnk_Name = False

                    End If
                End If

的示例,但我希望因为它是空白来读取位置

but I want as it is as whitespaces to read position

推荐答案

(如果没有看到你的PDF,这个解释是我能想到的最好的解释。)

(Without seeing your PDF, this explanation is the best I can come up with.)

您的文档不包含任何空格。也就是说,文档的内容流不包含空格。相反,渲染字符的指令只考虑了需要存在的空间。

Your document does not contain any spaces. That is to say, the content streams of your document do not contain spaces. In stead, the instructions that render characters simply take into account the space that needs to be there.

在这种情况下,iText必须猜测空格所在的位置。并且每当两个字符比正在使用的字体的空白字符的宽度更远时,它将估计插入1个空格。

In that case, iText has to "guess" where the spaces are. And it will estimate to insert 1 space every time two characters are further apart that the width of the whitespace character of the font that is being used.

可能这是这里的出错了。

Possibly that's where this is going wrong.

同样重要的是,你应该从不使用文本位置来提取数据。这种方法太容易出错。

Equally important however, you should never use text positions to extract data. This approach is simply too error-prone.

尝试使用正则表达式和更好的ITextExtractionStrategy。 ITextExtractionStrategy有一个实现,允许您指定一个Rectangle。如果您这样做,您可以更精确地从文档中获取内容。

Try using regular expressions combined with a better ITextExtractionStrategy. There is an implementation of ITextExtractionStrategy that allows you to specify a Rectangle. If you do it that way, you can get the content from your document in a much more precise way.

由于您正在处理银行对帐单,因此应该很容易使用基于矩形的搜索正则表达式的组合来提取内容(例如,查找与银行帐号匹配的内容)

Since you're dealing with bank statements, it should be easy to extract content by using a combination of rectangle-based-search and regular expressions (e.g. looking for things matching bank-account numbers)

这篇关于如何使用iTextsharp在c#.net中逐行读取带有空格(实际上)的pdf文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆