VBA Word的重复删除无法有效运行 [英] Duplicate removal for VBA Word not working effectively
问题描述
我有一个程序可以删除重复项,并且一切正常.只是冻结了大数据集,即1到250万个字.
I have a program to remove duplicates and everything is working properly. It is just freezing with large data sets i.e. 1 to 2.5 million words.
我的方法有什么问题?有更好的吗?
What is wrong with my approach? Is there a better one?
Sub DeleteDuplicateParagraphs()
Dim p1 As Paragraph
Dim p2 As Paragraph
Dim DupCount As Long
Dim StartTime As Double
Dim SecondsElapsed As Double
StartTime = Timer
For Each p1 In ActiveDocument.Paragraphs
If p1.range.Text <> vbCr Then
For Each p2 In ActiveDocument.Paragraphs
If p1.range.Text = p2.range.Text Then
DupCount = DupCount + 1
If p1.range.Text = p2.range.Text And DupCount > 1 Then p2.range.Delete
End If
Next p2
End If
DupCount = 0
Next p1
SecondsElapsed = Round(Timer - StartTime, 2)
MsgBox "This code ran successfully in " & SecondsElapsed & " seconds", vbInformation
DupCount = 0
End Sub
推荐答案
尝试一下(首先将对Microsoft脚本运行时的引用添加到您的VBA项目中):
Try this (first add a reference to the Microsoft Scripting Runtime to your VBA project):
Sub DeleteDuplicateParagraphs()
Dim p As Paragraph
Dim d As New Scripting.Dictionary
Dim t As Variant
Dim i As Integer
Dim StartTime As Single
StartTime = Timer
' collect duplicates
For Each p In ActiveDocument.Paragraphs
t = p.Range.Text
If t <> vbCr Then
If Not d.Exists(t) Then d.Add t, New Scripting.Dictionary
d(t).Add d(t).Count + 1, p
End If
Next
' eliminate duplicates
Application.ScreenUpdating = False
For Each t In d
For i = 2 To d(t).Count
d(t)(i).Range.Delete
Next
Next
Application.ScreenUpdating = True
MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub
这利用了Scripting.Dictionary
是一个哈希表的事实,该哈希表旨在非常快速地将唯一键与值相关联.因此,它非常适合发现重复的密钥.字典键必须是字符串,我们可以方便地使用段落文本.
This makes use of the fact that the Scripting.Dictionary
is a hash table that is geared towards very quickly associating unique keys with values. It is therefore very good at spotting duplicate keys. Dictionary keys have to be strings, conveniently we can use the paragraph texts for that.
对于值,我们使用更多的字典对象,仅因为它们比VBA的数组好得多.在其中,我们以相同的文本收集对实际段落实例的引用.
For values we use more dictionary objects, solely for the fact that they work a lot better than VBA's arrays. In them we collect the references to the actual paragraph instances with the same text.
实际上,删除重复的段落是一件很简单的事.
Actually deleting duplicate paragraphs is a very simple matter afterwards.
注意:上面代码中的重复检测部分非常快.但是,如果Word在大型文档中变得无响应,则它位于重复删除部分,即由于Word的撤消缓冲区.
Note: The duplicate detection part in the above code is very fast. However, if Word becomes unresponsive in large documents then it's in the duplicate removal part, namely because of Word's undo buffer.
罪魁祸首是段落范围被一个接一个地删除,导致Word建立一个很大的撤消缓冲区.不幸的是,(我知道)没有办法
The culprit is that the paragraph ranges are deleted one after another, causing Word to build a very large undo buffer. Unfortunately there is no way (that I know of) to either
- 一步删除多个单独的范围(这只会导致撤消缓冲区中只有一个条目),或者
- 完全从VBA禁用撤消缓冲区
在消除重复项"循环中定期调用UndoClear
可能会有所帮助,禁用ScreenUpdating
也不是一个坏主意:
Calling UndoClear
periodically in the "eliminate duplicates" loop might help, disabling ScreenUpdating
is also not a bad idea:
' eliminate duplicates
Dim x As Integer
Application.ScreenUpdating = False
For Each t In d
x = x + 1
For i = 2 To d(t).Count
d(t)(i).Range.Delete
Next
If x Mod 50 = 0 Then ActiveDocument.UndoClear
Next
ActiveDocument.UndoClear
Application.ScreenUpdating = True
这篇关于VBA Word的重复删除无法有效运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!