在vb.net中使用RegEx [英] Using RegEx in vb.net

查看:100
本文介绍了在vb.net中使用RegEx的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我需要做的(为清楚起见)
取一个PDF文件(底部的链接)
然后仅将每个标题下的信息解析为DataFridView。
我想不出一种方法(因为没有本地方法来处理PDF)
所以我唯一的想法是将其转换为txt文档,然后(以某种方式)将txt

Here is what I need to do (for clarity) Take a PDF file (link on the bottom) Then parse only the information under each header into a DataFridView. I couldn't think of a way to do this (seeing as there is no native way to handle PDFs) So my only thought was to convert it to a txt document then (somehow) take the txt from the text document and put it into the datagridview.

因此,使用Itextsharp,我首先将PDF转换为文本文件;然后将其转换为文本文件。保持最多的状态格式(请参见下面的链接)

So, using Itextsharp I first convert the PDF to a text file; Which keeps "most" of its formatting (see link below)

这是该源代码

 Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
    Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
    Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
    Dim mPageCount As Integer = mPDFreader.NumberOfPages()
    Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
    'Create the text file.
    Dim fs As FileStream = File.Create(mTXT)
    Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
    For i As Integer = 1 To mPageCount
        strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
        Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
        fs.Write(info, 0, info.Length)
    Next
    fs.Close()

但是我只需要行信息。所以一切应该看起来像这样

however I only need the "lines" of information. So everything should look like this

63 FMPC0847535411 OD119523523152105000 2020年8月28日下午02:18 PM
64 FMPP0532201112 OD119523544975573000 2020年8月28日02:18 PM Express
65 FMPP0532243104 OD119523557412412000 2020年8月28日下午2:18快递
66 FMPC0847516962 OD119523576945605000 2020年8月28日02:18 PM快递
67 FMPC0847520947 OD119523760191783000 2020年8月28日

63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS 64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS 65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS 66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS 67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS

现在要执行此操作,我需要使用RegEx删除所有我不想要的
,这里是我使用的RegEx

In order to do that now I needed to use RegEx to remove everything I didn't want here is the RegEx I Used

The RegEx is 
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";

这是我使用的代码。

Private Sub Fixtext()

        Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
        Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
            While (True)
                Dim line As String = reader.ReadLine()
                If line = Nothing Then
                    Return
                End If
                Dim match As Match = regex.Match(line)
                If match.Success Then
                    Dim value As String = match.Groups(1).Value
                    Console.WriteLine(line)
                End If
            End While
        End Using
End Sub

结果是关闭;但不完全是我需要的方式。在某些情况下,它们被塞满。在一起,仍然留下了一些零件。一个例子是

The results are "close" but not exactly the way I need it. In some cases they are "crammed" together and there are still parts left behind. An example would be

90 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS
491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS
Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 
493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS

我实际需要的格式是(再次)我可以用来稍后将数据导入到datagridview
,因此每一行都需要

the format I actually need is (again) a format I can use to import the data later into a datagridview so for each line it needs to be

[number][ID][ID2][Date][Notes] 
[number][ID][ID2][Date][Notes]
[number][ID][ID2][Date][Notes] 
[number][ID][ID2][Date][Notes] 

使用此概念这是我需要的一个示例(尽管我知道这不起作用,但是类似的东西仍然可以起作用)

using this "Concept" This is an example of what I need (though i know this doesn't work, but something along these lines that will work)

  Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
            Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
                While (True)
                    Dim line As String = reader.ReadLine()
                    If line = Nothing Then
                        Return
                    End If
                    Dim match As Match = regex.Match(line)
                    If match.Success Then
                        Dim value As String = match.Groups(1).Value
                        Dim s As String = value
                        s = s.Replace(" Tracking Id Forms Required Order Id RTS done on Notes", Nothing)
                        s = s.Replace("EXPRESS ", "EXPRESS")
                        s = s.Replace("EXPRESS", "EXPRESS" & vbCrLf)
                        Console.WriteLine(line)
                    End If
                End While
            End Using

这里是一个简要信息,

Here is a "brief" explanation with files included.

原始PDF的副本(这是使用itext将PDF转换为.txt的文件)
我之所以这样做是因为我想不到一种方法(无需支付第三方工具即可将pdf转换为XLS)

Copy of the original PDF (This is the PDF being converted to .txt using itext) I am only doing this because I can't think of a way (outside of paying for a 3rd party tool to convert a pdf to XLS)

https://drive.google.com/file/d/1iHMM_G4UBUlKaa44-Wb00F_9ZdG-vYpM/view?usp=sharing

使用上述 itext方法我提到这是输出的转换文件

using the above "itext method" I mentioned this is the outputted converted file

https://drive.google.com/file/d/10dgJDFW5XlhsB0_0QAWQvtimsDoMllx-/view?usp=sharing

然后我使用上述正则表达式(上文已提及) )解析出我不需要的内容。
,但它不起作用。

I then use the above Regex (mentioned above) to parse out what I don't need. however it isn't working.

所以我的问题是(澄清)

So my Questions are (for "clarity")


  1. 这是做我需要做的唯一或最佳方法吗? (将PDF转换为文本,删除不需要的信息,然后将该信息输入到DataGridView中;或者还有另一个更清洁,更好的方法?

  1. Is this the only or best method to do what I need done? (Convert PDF to text, Remove what I don't need then input that information into a DataGridView; Or is there another , Cleaner , Better method?

(如果不是1)我该怎么做?RegEx或Logic出了点问题吗?我是否缺少更好/更干净的东西,有人可以帮我看看。

(if not 1) How can I make this work? Is something wrong with my RegEx or My Logic? Am I missing something better/cleaner that someone can help me see.

(如果2 ^不是1)获取结果并将其放置在适当的DataGridView列中的最佳方法是什么。

(if 2 ^ Not 1) What is the best way to take the results and place them in the proper DataGridView Column.

最终声明:不必是这种方法,我将采用 ANY方法,该方法可以让我做我需要做的事情,清洁工更好,但是我必须避免使用有限制的免费第三方库;付费的第三方库。这给我带来了限制。IE:PDFBox,itext,itextsharp)而且这必须能够引导我从PDF中获取(例如以上示例)到Datagridview甚至listview中的表信息。

Final Statement: It doesn't have to be this method. I will take "ANY" method that will allow me to do what I need to be done, the cleaner the better however I have to do this avoiding 3rd party libraries that are free with limitations; Paid 3rd party libraries. That leaves me with limitations. IE: PDFBox, itext,itextsharp) And this has to be able to lead me from a PDF (like the above sample) to that table information in a Datagridview or even a listview.

我将寻求帮助,我是矿石然后升值。我也重新提出了这个问题,因为一个mod结束了我原来的问题说不清楚我需要什么。在这两种情况下,我都曾尝试将问题设为彻底。尽可能但我希望这是更清晰因此它不会突然关闭。

I will take any help and I am more then appreciative. Also I did re-Ask this question because a mod closed my original question "Stating it wasn't clear what I needed" I did try in both cases to make the question as "thorough" as possible but I do hope this is "Clearer" so it doesn't get closed abruptly.

推荐答案

我通过更正文本文件作了一点欺骗。它在分页符时有点​​不可靠,错过了开始新行的时间。也许您可以使用Itextsharp或难以维护的正则表达式来纠正它。

I cheated a bit by correcting the text file. It goes a little wonky at page breaks and misses starting a new line. Perhaps you can correct that with Itextsharp or the hard to maintain regex.

我制作了一个类来保存数据。属性名称成为 DataGridView 中的列标题。

I made a class to hold the data. The property names become the column headers in the DataGridView.

我将文本文件中的所有行读入数组。我检查了行的第一个字符,看它是否是一个数字,然后根据空格将行拆分为另一个数组。接下来,我创建了一个新的 Tracking 对象,并使用参数化构造函数对其进行了充实。

I read all the lines in the text file into an array. I checked the first character of the line to see if it was a digit then split the line into another array based on the space. Next I created a new Tracking object, fleshing it out with all its properties with the parameterized constructor.

最后,我检查了一行包含一个逗号,并将该段文本添加到notes参数。

Finally, I checked it the line contained a comma and added that bit of text to the notes parameter. The completed object is added to the list.

循环后, lst 绑定到网格。

Public Class Tracking
    Public Property Number As Integer
    Public Property ID As String
    Public Property ID2 As String
    Public Property TrackDate As Date
    Public Property Notes As String
    Public Sub New(TNumber As Integer, TID As String, TID2 As String, TDate As DateTime, TNotes As String)
        Number = TNumber
        ID = TID
        ID2 = TID2
        TrackDate = TDate
        Notes = TNotes
    End Sub
End Class

Private Sub OPCode()
    Dim lst As New List(Of Tracking)
    Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
    For Each line In lines
        If Char.IsDigit(line(0)) Then
            Dim parts = line.Split(" "c)
            Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
            If line.Contains(",") Then
                T.Notes &= line.Substring(line.IndexOf(","))
            End If
            lst.Add(T)
        End If
    Next
    DataGridView1.DataSource = lst
End Sub

编辑

要查明错误,请尝试... / p>

EDIT
To pinpoint the error let's try...

Private Sub OPCode()
    Dim lst As New List(Of Tracking)
    Dim lines = File.ReadAllLines("C:\Users\maryo\Desktop\test.txt")
    For Each line In lines
        If Char.IsDigit(line(0)) Then
            Dim parts = line.Split(" "c)
            If parts.Length < 9 Then
                Debug.Print(line)
                MessageBox.Show($"We have a line that does not include all fields.")
                Exit Sub
            End If
            Dim T As New Tracking(CInt(parts(0)), parts(1), parts(2), Date.ParseExact($"{parts(3)} {parts(4)} {parts(5)} {parts(6)} {parts(7)}", "MMM d, yyyy hh:mm tt", CultureInfo.CurrentCulture), parts(8))
            If line.Contains(",") Then
                T.Notes &= line.Substring(line.IndexOf(","))
            End If
            lst.Add(T)
        End If
    Next
    DataGridView1.DataSource = lst
End Sub

这篇关于在vb.net中使用RegEx的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆