根据页面上的条形码拆分多页PDF,直到下一个唯一的条形码 [英] Split multi-page PDFs based on barcode on page till the next unique barcode

查看:98
本文介绍了根据页面上的条形码拆分多页PDF,直到下一个唯一的条形码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我有适用于一个文件的VB.NET代码,它会根据每个页面上的唯一条形码来识别该文件.

I have VB.NET code that works for one file so far and it splits that file based on a unique bar code that is on each page to identify it.

每个条形码都是以下之一:

Each barcode is one of:

COVERSPLIT
投诉地点
展览品
MILSPLIT
总结

COVERSPLIT
COMPLAINTSPLIT
EXHIBITSPLIT
MILSPLIT
SUMSPLIT

问题是:例如,第一页具有COVERSPLIT的条形码,因为它是封面,但是下一页也是封面,但没有上面有条形码.因此,当我运行我的代码时,只会提取带有已识别条形码的工作表,而忽略那些没有识别出的条形码.

The problem is: say, for instance, the first page has the barcode of COVERSPLIT because it's a coversheet, but the next sheet is also a coversheet but it does not have the barcode on it. So when I run my code it's only extracting the sheets with those identified barcodes and leaving off the ones that don't.​

我尝试这样做:

Imports Bytescout.PDFExtractor
Imports System.Collections
Imports System.Collections.Generic
Imports System.IO.Path
Class Program


    Friend Shared Sub Main(args As String())



        Dim Dir As String = "G:\Word\Department Folders\Pre-Suit\Drafts-IL\2-IL_AttyReview\2018-09\Reviewed\"
        Dim inputFile As String = Dir & "ZTEST01.SMITH.pdf"
        Dim Unmerged As String = Dir & "unmerged\"

        Dim Path As String = IO.Path.GetFileNameWithoutExtension(inputFile)
        Dim Extracted As String = Path.Substring(0, 7)

        ' Create Bytescout.PDFExtractor.TextExtractor instance
        Dim extractor As New TextExtractor()

        ' Load sample PDF document
        extractor.LoadDocumentFromFile(inputFile)

        Dim pageCount As Integer = extractor.GetPageCount()

        ' Search each page for a keyword 
        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "COVERSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 1
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Cover Sheet " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)


                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next

        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "COVERSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 2
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Cover Sheet " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)


                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next
        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "COMPLAINTSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 1
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Complaint " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)

                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next

        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "COMPLAINTSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 2
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Complaint " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)

                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next
        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "EXHIBITSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 1
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Exhibit " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)

                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next

        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "EXHIBITSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 2
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Exhibit " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)

                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next
        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "MILSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 1
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Military " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)

                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next

        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "SUMSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 1
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Summons " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)

                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next

        For i As Integer = 0 To pageCount - 1

            If extractor.Find(i, "SUMSPLIT", False) Then

                ' Extract page
                Using splitter As New DocumentSplitter()

                    splitter.OptimizeSplittedDocuments = True

                    Dim pageNumber As Integer = i + 2
                    ' (!) page number in ExtractPage() is 1-based

                    Dim outputfile As String = Unmerged & Extracted & " Summons " & pageNumber.ToString() & ".pdf"

                    splitter.ExtractPage(inputFile, outputfile, pageNumber)

                    Console.WriteLine("Extracted page " & pageNumber.ToString() & " to file """ & outputfile & """")

                End Using
            End If
        Next

        ' Cleanup
        extractor.Dispose()

        Console.WriteLine()
        Console.WriteLine("Press any key...")
        Console.ReadKey()

    End Sub
End Class

如您所见,我只是复制并粘贴了相同的For i...循环,只是将Dim pageNumber更改为Integer i + 1到i +2,以包括其辅助页面.

As you can see, I just copied and pasted the same For i... loop and just changed Dim pageNumber as Integer i+1 to i +2 to include its secondary page.

这样做的问题是,有时带有唯一条形码的页面后可能会有不确定数量的页面....

The problem with that is that sometimes the page with the unique barcode can have a indeterminate number of pages after it....

因此,我将如何编写它以便将其提取出来,例如:

So, how would I write this so that it extracts, for example:

第COVERSPLIT页+所有后续页面没有条形码,直到到达下一页 条形码(例如,COMPLAINTSPLIT)? 而且,我该怎么做,以便提取带有条形码COVERSPLIT的页面及其页面(直到到达下一个条形码),而将所有这些页面都保存在一个pdf中?

Page COVERSPLIT + all the subsequent pages without a barcode until it gets to the next page with a barcode (COMPLAINTSPLIT, for example)? ​ And also, how could I do this so that it extracts the page with barcode COVERSPLIT with its pages (until it reaches the next barcode) but keeping all those pages together in one pdf?

推荐答案

您已经注意到您有很多重复的代码.在这种情况下,您可以做的是将一小部分(该变量在其他方面相同的代码之间变化)放入一个变量中.

You have already noticed that you have a lot of repeated code. What you can do in that case is put the small part which varies between the otherwise-identical code into a variable.

因此,如果我们获得了识别页面类型的条形码列表,则可以对其进行迭代以找出当前页面的类型.如果没有条形码,则假定页面类型与上一页相同.

So, if we get a list of the barcodes which identify the type of a page we can iterate over them to find out what type the current page is. If there is no barcode then we assume the page type is unchanged from the previous page.

Option Infer On
Option Strict On

Imports System.IO

Module Module1

    Class PageType
        Property Identifier As String
        Property TypeName As String
    End Class

    Sub Main()
        Dim dir = "G:\Word\Department Folders\Pre-Suit\Drafts-IL\2-IL_AttyReview\2018-09\Reviewed\"

        Dim inputFile = Path.Combine(dir, "ZTEST01.SMITH.pdf")
        Dim unmerged = Path.Combine(dir, "unmerged")

        ' Set up a list of the identifiers to be searched for and the corresponding names to be used in the filename.
        Dim pageTypes As New List(Of PageType)
        Dim ids = {"COVERSPLIT", "COMPLAINTSPLIT", "EXHIBITSPLIT", "MILSPLIT", "SUMSPLIT"}
        Dim nams = {" Cover Sheet ", " Complaint ", " Exhibit ", " Military ", " Summons "}
        For i = 0 To ids.Length - 1
            pageTypes.Add(New PageType With {.Identifier = ids(i), .TypeName = nams(i)})
        Next

        Dim extracted = Path.GetFileNameWithoutExtension(inputFile).Substring(0, 7)

        Dim extractor As New TextExtractor()

        ' Load sample PDF document
        extractor.LoadDocumentFromFile(inputFile)

        Dim pageCount = extractor.GetPageCount()
        Dim currentPageTypeName = "UNKNOWN"

        ' Search each page for a keyword 
        For i = 0 To pageCount - 1

            ' Find the type of the current page
            ' If it is not present on the page, then the last one found will be used.
            For Each pt In pageTypes
                If extractor.Find(i, pt.Identifier, False) Then
                    currentPageTypeName = pt.TypeName
                End If
            Next

            ' Extract page
            Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}
                Dim pageNumber = i + 1  ' (!) page number in ExtractPage() is 1-based
                Dim outputfile = Path.Combine(unmerged, extracted & currentPageTypeName & pageNumber & ".pdf")

                splitter.ExtractPage(inputFile, outputfile, pageNumber)

                Console.WriteLine("Extracted page " & pageNumber & " to file """ & outputfile & """")

            End Using

        Next

        extractor.Dispose()

        Console.WriteLine()
        Console.WriteLine("Press any key...")
        Console.ReadKey()

    End Sub

End Module

怀疑 Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True}应该在For循环的外部,这样就不会为每个页面创建和销毁它.

I suspect that the Using splitter As New DocumentSplitter() With {.OptimizeSplittedDocuments = True} should be outside the For loop so that it is not created and destroyed for every page.

我重命名了您的page变量,因为它干扰了IO.Path的简洁使用.最好使用Path.Combine方法合并路径的各个部分,因为它会为您处理路径分隔符.

I renamed your page variable as it interfered with the concise use of IO.Path. It's better to use the Path.Combine method to combine parts of a path because it takes care of the path separator characters for you.

要将类型的所有页面累积到一个文件中,您必须检测类型何时更改,然后使用

To accumulate all the pages of a type into one file, you would have to detect when the type changes and then use ExtractPageRange method. I don't have Bytescout.PDFExtractor or the example PDF, so I can't try it out.

这篇关于根据页面上的条形码拆分多页PDF,直到下一个唯一的条形码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆