Excel VBA可以搜索PDF中的文本并提取和命名页 [英] Excel VBA to Search for Text in PDF and Extract and Name Pages

查看:691
本文介绍了Excel VBA可以搜索PDF中的文本并提取和命名页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码,该代码查看电子表格的A列中的每个单元格,在指定的PDF中搜索它在那里找到的文本,然后提取找到该文本为PDF的页面,并用电子表格单元格中的值.该代码有效,但速度很慢,我可能需要在PDF中搜索多达200个单词,而这些单词可能长达600页.有没有办法使代码更快?当前,它在每个单元格中循环搜索,在每个页面中循环搜索每个单词,直到在该单元格中找到单词为止.

I have the following code, which looks at each cell in column A of my spreadsheet, searches for the text it finds there in the specified PDF and then extracts the page where it finds the text as a PDF, naming it with the value in the cell of the spreadsheet. The code works but is rather slow, I may need to search for as many as 200 words in a PDF which could be as long as 600 pages. Is there a way to make the code faster? Currently it loops through each cell searches through each page looping through each word until it finds the word in the cell.

    Sub test_with_PDF()

    Dim objApp As Object
    Dim objPDDoc As Object
    Dim objjso As Object
    Dim wordsCount As Long
    Dim page As Long
    Dim i As Long
    Dim strData As String
    Dim strFileName As String
    Dim lastrow As Long, c As Range
    Dim PageNos As Integer
    Dim newPDF As Acrobat.CAcroPDDoc
    Dim NewName As String
    Dim Folder As String
    lastrow = Sheets("Sheet1").Cells(Rows.Count, "A").End(xlUp).Row

    strFileName = selectFile()
    Folder = GetFolder()

    Set objApp = CreateObject("AcroExch.App")
    Set objPDDoc = CreateObject("AcroExch.PDDoc")
    'AD.1 open file, if =false file is damage
    If objPDDoc.Open(strFileName) Then
        Set objjso = objPDDoc.GetJSObject

 PageNos = 0
 For Each c In Sheets("Sheet1").Range("A2:A" & lastrow)

        For page = 0 To objPDDoc.GetNumPages - 1
            wordsCount = objjso.GetPageNumWords(page)
            For i = 0 To wordsCount

                If InStr(1, c.Value, ", ") = 0 Then

                    If objjso.getPageNthWord(page, i) = c.Value Then
                        PageNos = PageNos + 1
                        If FileExist(Folder & "\" & c.Offset(0, 4) & ".pdf") Then

                                Set newPDF = CreateObject("AcroExch.pdDoc")
                                NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
                                newPDF.Open (NewName)
                                newPDF.InsertPages lngPages, objPDDoc, page, 1, 0
                                newPDF.Save 1, NewName
                                newPDF.Close
                                Set newPDF = Nothing
                                Exit For
                         Else
                                Set newPDF = CreateObject("AcroExch.PDDoc")
                                newPDF.Create
                                NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
                                newPDF.InsertPages -1, objPDDoc, page, 1, 0
                                newPDF.Save 1, NewName
                                newPDF.Close
                                Set newPDF = Nothing
                                Exit For

                        End If
                    End If
                Else

                If objjso.getPageNthWord(page, i) = c.Offset(0, 1).Value Then
                    If objjso.getPageNthWord(page, i + 1) = c.Offset(0, 2).Value Then
                        PageNos = PageNos + 1
                         If FileExist(Folder & "\" & c.Offset(0, 4) & ".pdf") Then

                                Set newPDF = CreateObject("AcroExch.pdDoc")
                                NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
                                newPDF.Open (NewName)
                                newPDF.InsertPages lngPages, objPDDoc, page, 1, 0
                                newPDF.Save 1, NewName
                                newPDF.Close
                                Set newPDF = Nothing
                                Exit For
                         Else
                                Set newPDF = CreateObject("AcroExch.PDDoc")
                                newPDF.Create
                                NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
                                newPDF.InsertPages -1, objPDDoc, page, 1, 0
                                newPDF.Save 1, NewName
                                newPDF.Close
                                Set newPDF = Nothing
                                Exit For

                        End If
                        Exit For
                    End If
                End If
            End If
            Next i
        Next page
        c.Offset(0, 3).Value = PageNos
        PageNos = 0
    Next c
    MsgBox "Done"
    Else
        MsgBox "error!"
    End If
End Sub

Function FileExist(path As String) As Boolean
    If Dir(path) <> vbNullString Then FileExist = True
End Function
Function selectFile()
Dim fd As FileDialog, fileName As String

On Error GoTo ErrorHandler

Set fd = Application.FileDialog(msoFileDialogFilePicker)

fd.AllowMultiSelect = False

If fd.Show = True Then
    If fd.SelectedItems(1) <> vbNullString Then
        fileName = fd.SelectedItems(1)
    End If
Else
    'Exit code if no file is selected
    End
End If

'Return Selected FileName
selectFile = fileName

Set fd = Nothing

Exit Function

ErrorHandler:
Set fd = Nothing
MsgBox "Error " & Err & ": " & Error(Err)

End Function
Function GetFolder() As String
    Dim fldr As FileDialog
    Dim sItem As String
    Set fldr = Application.FileDialog(msoFileDialogFolderPicker)
    With fldr
        .Title = "Select the Folder where you want you new PDFs to go"
        .AllowMultiSelect = False
        .InitialFileName = Application.DefaultFilePath
        If .Show <> -1 Then GoTo NextCode
        sItem = .SelectedItems(1)
    End With
NextCode:
    GetFolder = sItem
    Set fldr = Nothing
End Function

非常感谢.

推荐答案

对于某些事情,循环绝对是出色的选择,但可以将处理与这些较高的查询联系在一起.最近,我和一位同事正在执行类似的任务(虽然与pdf无关),但是我们使用range.find方法代替在每个单元格上执行instr的循环取得了很大的成功.

Loops are definitely excellent for some things, but can tie down processing with these higher queries. Recently, a colleague and I were doing a similar task (not pdf-related though), and we had much success with using a range.find method instead of a loop executing instr on each cell.

一些景点: -为模仿使用.find方法时的循环单元"功能,我们以.cells结尾我们的range语句,如下所示:

Some points of interest: -To mimic the "loop cells" functionality when using the .find method, we ended our range statement with .cells, as seen below:

activesheet.usedrange.cells.find()

activesheet.usedrange.cells.find( )

所需的字符串在()中.

Where the desired string goes within the ( ).

-返回值:一个Range对象,表示找到该信息的第一个单元格."

-The return value: "A Range object that represents the first cell where that information is found."

.find方法返回范围后,随后的子例程可以提取页码和文档名称.

Once the .find method returns a range, a subsequent subroutine can extract the page number and document name.

-如果需要查找事件的第n个实例,可以使用FindNext和FindPrevious方法重复搜索." (微软)

-If you need to find the nth instance of an occurrence, "You can use the FindNext andFindPrevious methods to repeat the search." (Microsoft)

Microsoft range.find概述: https://msdn.microsoft. com/en-us/vba/excel-vba/articles/range-find-method-excel

Microsoft overview of range.find: https://msdn.microsoft.com/en-us/vba/excel-vba/articles/range-find-method-excel

因此,使用这种方法,用户可以使用基于列表中单元格数量的循环对每个字符串执行.find方法.

So with this approach, the user can use a loop based on a count of cells in your list to execute the .find method for each string.

缺点是(我认为)这必须在excel应用程序中的文本上完成;另外,我还没有测试过它是否可以确定字符串是否必须独自驻留在单元格中(我不认为这是个问题).

Downside is (I assume) that this must be done on text within the excel application; also, I’ve not tested it to determine if the string has to inhabit the cell by itself (I don’t think this is a concern).

‘==================

‘===================

另一个可能有益的建议是,首先批量复制.pdf中的所有文本,并尽可能减少循环(在文档对象级别上的直接操作).然后,您的查找/返回方法可以应用于批量文本.

Another suggestion that might be beneficial is to first bulk-rip all text from the .pdf with as little looping as possible (direct actions at the document object level). Then your find/return approach can be applied to the bulk text.

从教授的PowerPoint创建学习笔记时,我进行了类似的活动;我将所有文本抓到一个.txt文件中,然后返回包含字符串列表实例的每个句子.

I did a similar activity when creating study notes from a professor’s PowerPoints; I grabbed all the text into a .txt file, then returned every sentence containing the instance of a list of strings.

‘===================

‘=====================

一些警告:我承认我没有按照您的项目的规模执行解析,所以我的建议在实践中可能没有优势.

A few caveats: I admit that I have not executed parsing at the sheer size of your project, so my suggestions might not be advantageous in practice.

此外,我在解析.pdf文档方面还没有做很多工作,因为我尝试首先选择.txt/excel应用程序,然后使用它.

Also, I have not done much work parsing .pdf documents, as I try to opt for anything that is .txt/excel app first, and engage it instead.

祝您好运;我希望我至少能够提供思想上的帮助!

Good luck in your endeavors; I hope I was able to at least provide food for thought!

这篇关于Excel VBA可以搜索PDF中的文本并提取和命名页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆