Excel VBA可以搜索PDF中的文本并提取和命名页 [英] Excel VBA to Search for Text in PDF and Extract and Name Pages
问题描述
我有以下代码,该代码查看电子表格的A列中的每个单元格,在指定的PDF中搜索它在那里找到的文本,然后提取找到该文本为PDF的页面,并用电子表格单元格中的值.该代码有效,但速度很慢,我可能需要在PDF中搜索多达200个单词,而这些单词可能长达600页.有没有办法使代码更快?当前,它在每个单元格中循环搜索,在每个页面中循环搜索每个单词,直到在该单元格中找到单词为止.
I have the following code, which looks at each cell in column A of my spreadsheet, searches for the text it finds there in the specified PDF and then extracts the page where it finds the text as a PDF, naming it with the value in the cell of the spreadsheet. The code works but is rather slow, I may need to search for as many as 200 words in a PDF which could be as long as 600 pages. Is there a way to make the code faster? Currently it loops through each cell searches through each page looping through each word until it finds the word in the cell.
Sub test_with_PDF()
Dim objApp As Object
Dim objPDDoc As Object
Dim objjso As Object
Dim wordsCount As Long
Dim page As Long
Dim i As Long
Dim strData As String
Dim strFileName As String
Dim lastrow As Long, c As Range
Dim PageNos As Integer
Dim newPDF As Acrobat.CAcroPDDoc
Dim NewName As String
Dim Folder As String
lastrow = Sheets("Sheet1").Cells(Rows.Count, "A").End(xlUp).Row
strFileName = selectFile()
Folder = GetFolder()
Set objApp = CreateObject("AcroExch.App")
Set objPDDoc = CreateObject("AcroExch.PDDoc")
'AD.1 open file, if =false file is damage
If objPDDoc.Open(strFileName) Then
Set objjso = objPDDoc.GetJSObject
PageNos = 0
For Each c In Sheets("Sheet1").Range("A2:A" & lastrow)
For page = 0 To objPDDoc.GetNumPages - 1
wordsCount = objjso.GetPageNumWords(page)
For i = 0 To wordsCount
If InStr(1, c.Value, ", ") = 0 Then
If objjso.getPageNthWord(page, i) = c.Value Then
PageNos = PageNos + 1
If FileExist(Folder & "\" & c.Offset(0, 4) & ".pdf") Then
Set newPDF = CreateObject("AcroExch.pdDoc")
NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
newPDF.Open (NewName)
newPDF.InsertPages lngPages, objPDDoc, page, 1, 0
newPDF.Save 1, NewName
newPDF.Close
Set newPDF = Nothing
Exit For
Else
Set newPDF = CreateObject("AcroExch.PDDoc")
newPDF.Create
NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
newPDF.InsertPages -1, objPDDoc, page, 1, 0
newPDF.Save 1, NewName
newPDF.Close
Set newPDF = Nothing
Exit For
End If
End If
Else
If objjso.getPageNthWord(page, i) = c.Offset(0, 1).Value Then
If objjso.getPageNthWord(page, i + 1) = c.Offset(0, 2).Value Then
PageNos = PageNos + 1
If FileExist(Folder & "\" & c.Offset(0, 4) & ".pdf") Then
Set newPDF = CreateObject("AcroExch.pdDoc")
NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
newPDF.Open (NewName)
newPDF.InsertPages lngPages, objPDDoc, page, 1, 0
newPDF.Save 1, NewName
newPDF.Close
Set newPDF = Nothing
Exit For
Else
Set newPDF = CreateObject("AcroExch.PDDoc")
newPDF.Create
NewName = Folder & "\" & c.Offset(0, 4) & ".pdf"
newPDF.InsertPages -1, objPDDoc, page, 1, 0
newPDF.Save 1, NewName
newPDF.Close
Set newPDF = Nothing
Exit For
End If
Exit For
End If
End If
End If
Next i
Next page
c.Offset(0, 3).Value = PageNos
PageNos = 0
Next c
MsgBox "Done"
Else
MsgBox "error!"
End If
End Sub
Function FileExist(path As String) As Boolean
If Dir(path) <> vbNullString Then FileExist = True
End Function
Function selectFile()
Dim fd As FileDialog, fileName As String
On Error GoTo ErrorHandler
Set fd = Application.FileDialog(msoFileDialogFilePicker)
fd.AllowMultiSelect = False
If fd.Show = True Then
If fd.SelectedItems(1) <> vbNullString Then
fileName = fd.SelectedItems(1)
End If
Else
'Exit code if no file is selected
End
End If
'Return Selected FileName
selectFile = fileName
Set fd = Nothing
Exit Function
ErrorHandler:
Set fd = Nothing
MsgBox "Error " & Err & ": " & Error(Err)
End Function
Function GetFolder() As String
Dim fldr As FileDialog
Dim sItem As String
Set fldr = Application.FileDialog(msoFileDialogFolderPicker)
With fldr
.Title = "Select the Folder where you want you new PDFs to go"
.AllowMultiSelect = False
.InitialFileName = Application.DefaultFilePath
If .Show <> -1 Then GoTo NextCode
sItem = .SelectedItems(1)
End With
NextCode:
GetFolder = sItem
Set fldr = Nothing
End Function
非常感谢.
推荐答案
对于某些事情,循环绝对是出色的选择,但可以将处理与这些较高的查询联系在一起.最近,我和一位同事正在执行类似的任务(虽然与pdf无关),但是我们使用range.find方法代替在每个单元格上执行instr的循环取得了很大的成功.
Loops are definitely excellent for some things, but can tie down processing with these higher queries. Recently, a colleague and I were doing a similar task (not pdf-related though), and we had much success with using a range.find method instead of a loop executing instr on each cell.
一些景点: -为模仿使用.find方法时的循环单元"功能,我们以.cells结尾我们的range语句,如下所示:
Some points of interest: -To mimic the "loop cells" functionality when using the .find method, we ended our range statement with .cells, as seen below:
activesheet.usedrange.cells.find()
activesheet.usedrange.cells.find( )
所需的字符串在()中.
Where the desired string goes within the ( ).
-返回值:一个Range对象,表示找到该信息的第一个单元格."
-The return value: "A Range object that represents the first cell where that information is found."
.find方法返回范围后,随后的子例程可以提取页码和文档名称.
Once the .find method returns a range, a subsequent subroutine can extract the page number and document name.
-如果需要查找事件的第n个实例,可以使用FindNext和FindPrevious方法重复搜索." (微软)
-If you need to find the nth instance of an occurrence, "You can use the FindNext andFindPrevious methods to repeat the search." (Microsoft)
Microsoft range.find概述: https://msdn.microsoft. com/en-us/vba/excel-vba/articles/range-find-method-excel
Microsoft overview of range.find: https://msdn.microsoft.com/en-us/vba/excel-vba/articles/range-find-method-excel
因此,使用这种方法,用户可以使用基于列表中单元格数量的循环对每个字符串执行.find方法.
So with this approach, the user can use a loop based on a count of cells in your list to execute the .find method for each string.
缺点是(我认为)这必须在excel应用程序中的文本上完成;另外,我还没有测试过它是否可以确定字符串是否必须独自驻留在单元格中(我不认为这是个问题).
Downside is (I assume) that this must be done on text within the excel application; also, I’ve not tested it to determine if the string has to inhabit the cell by itself (I don’t think this is a concern).
‘==================
‘===================
另一个可能有益的建议是,首先批量复制.pdf中的所有文本,并尽可能减少循环(在文档对象级别上的直接操作).然后,您的查找/返回方法可以应用于批量文本.
Another suggestion that might be beneficial is to first bulk-rip all text from the .pdf with as little looping as possible (direct actions at the document object level). Then your find/return approach can be applied to the bulk text.
从教授的PowerPoint创建学习笔记时,我进行了类似的活动;我将所有文本抓到一个.txt文件中,然后返回包含字符串列表实例的每个句子.
I did a similar activity when creating study notes from a professor’s PowerPoints; I grabbed all the text into a .txt file, then returned every sentence containing the instance of a list of strings.
‘===================
‘=====================
一些警告:我承认我没有按照您的项目的规模执行解析,所以我的建议在实践中可能没有优势.
A few caveats: I admit that I have not executed parsing at the sheer size of your project, so my suggestions might not be advantageous in practice.
此外,我在解析.pdf文档方面还没有做很多工作,因为我尝试首先选择.txt/excel应用程序,然后使用它.
Also, I have not done much work parsing .pdf documents, as I try to opt for anything that is .txt/excel app first, and engage it instead.
祝您好运;我希望我至少能够提供思想上的帮助!
Good luck in your endeavors; I hope I was able to at least provide food for thought!
这篇关于Excel VBA可以搜索PDF中的文本并提取和命名页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!