正则表达式Microsoft Word,而不会破坏文档格式 [英] Regex Microsoft Word without destroying document formatting
问题描述
众所周知,单词的查找和替换通配符"功能受到一些严重的限制.
以下代码在word文档中实现了真正的正则表达式查找和替换,并且在其他Stackoverflow和SuperUser问题中也可以找到其正则表达式.
Sub RegEx_PlainText(Before As String, After As String)
Dim regexp As Object
Set regexp = CreateObject("vbscript.regexp")
With regexp
.Pattern = Before
.IgnoreCase = True
.Global = True
'could be any Range , .Range.Text , or selection object
ActiveDocument.Range = .Replace(ActiveDocument.Range, After)
End With
End Sub
但是,这会擦除所有格式的文档.
即使字符串长度相同或实际上是相同的字符串,Word也不会保留每个字符的格式设置,因此ActiveDocument.Range = ActiveDocument.Range
或Selection.Text=Selection.Text
将擦除所有格式设置(或更准确地说,将整个范围的格式设置为与范围内的第一个字符,并添加回车符).一经思考,这种行为就不足为奇了.
要解决此问题,以下代码将运行正则表达式查找,然后在匹配项中循环并仅在找到匹配项的范围内运行.replace
.然后,仅当出现以下情况时,格式才会丢失匹配的iself具有多种格式(例如,斜体字会丢失)
希望代码注释使此操作非常透明.
Sub RegEx(Before As String, After As String, _
Optional CaseSensitive As Boolean = False, _
Optional Location As Range = Nothing, _
Optional DebugMode As Boolean = False)
'can't declare activedocument.range in parameters
If Location Is Nothing Then Set Location = ActiveDocument.Range
Dim regexp As Object
Dim Foundmatches As Object
Dim Match As Object
Dim MatchRange As Range
Dim offset As Integer: offset = 0
Set regexp = CreateObject("vbscript.regexp")
With regexp
.Pattern = Before
.IgnoreCase = Not CaseSensitive
.Global = True
'set foundmatches to collection of all regex matches
Set Foundmatches = .Execute(Location.text)
For Each Match In Foundmatches
'set matchrange to location of found string in source doc.
'offset accounts for change in length of document from already completed replacements
Set MatchRange = Location.Document _
.Range(Match.FirstIndex + offset, _
Match.FirstIndex + Match.Length + offset)
'debugging
If DebugMode Then
Debug.Print "strfound = " & Match.Value
Debug.Print "matchpoint = " & Match.FirstIndex
Debug.Print "origstrlength = " & Match.Length
Debug.Print "offset = " & offset
Debug.Print "matchrange = " & MatchRange.text
MatchRange.Select
Stop
Else
'REAL LIFE
'run the regex replace just on the range containing the regex match
MatchRange = .Replace(MatchRange, After)
'increment offset to account for change in length of document
offset = offset + MatchRange.End - MatchRange.Start - Match.Length
End If
Next
End With
End Sub
这适用于简单文档,但是当我在真实文档上运行它时,matchrange
最终会出现在找到匹配项的某个位置,但并不完全正确.可以预测,有时它在右边,有时在左边.通常,文档越复杂. (链接,上下文表,格式等),最终导致的错误越多.
有人知道为什么这种方法不起作用以及如何解决吗?如果我能理解为什么这种方法不起作用,那么我也许能够确定这种方法是否可以解决,或者如果我只需要尝试其他方法.
代码包含DebugMode参数,这意味着它将仅遍历文档并突出显示所有匹配项,而不会执行任何更改.还将输出一些内容到控制台.这对足够善良地与我一起解决这个问题的人都应该有所帮助.
https://calibre-ebook.com/downloads/demos/demo.docx 这是一个示例文档(不是由我制作的),可能会有用.
@Some_Guy:感谢您提出这个问题,我遇到了类似的问题,您的帖子为我节省了很多时间.
这是我想出的缺点:
Sub RegEx(Before As String, After As String, _
Optional CaseSensitive As Boolean = False, _
Optional Location As Range = Nothing, _
Optional DebugMode As Boolean = False)
'can't declare activedocument.range in parameters
If Location Is Nothing Then Set Location = ActiveDocument.Range
Dim j As Long
Dim regexp As Object
Dim Foundmatches As Object
Dim Match As Object
Dim MatchRange As Range
Dim offset As Integer: offset = 0
Set regexp = CreateObject("vbscript.regexp")
With regexp
.Pattern = Before
.IgnoreCase = Not CaseSensitive
.Global = True
'set foundmatches to collection of all regex matches
Set Foundmatches = .Execute(Location.Text)
For j = Foundmatches.Count - 1 To 0 Step -1
If DebugMode = True Then
'debugging
Debug.Print Foundmatches(j), .Replace(Foundmatches(j), After)
Else
'REAL LIFE
'run a plain old find/replace on the found string and eplace strings
With ActiveDocument.Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Hidden = True
.Text = Foundmatches(j)
.Replacement.Text = regexp.Replace(Foundmatches(j), After)
.Execute Replace:=wdReplaceAll
End With
End If
Next j
End With
End Sub
基本上,我使用一个简单的查找/替换字符串,该字符串与用正则表达式匹配找到的每个项目(并将被替换)匹配,并且在Word中会提供对它的体面支持).请注意,任何替换的文本均采用第一个替换字符的格式,因此,如果第一个单词为粗体,则所有替换的文本均将为粗体.
It's well known that word's find and replace "wildcards" features suffer some severe limitations.
The following code implements true regex find and replace in a word document, and variations on it are found in other Stackoverflow and SuperUser questions.
Sub RegEx_PlainText(Before As String, After As String)
Dim regexp As Object
Set regexp = CreateObject("vbscript.regexp")
With regexp
.Pattern = Before
.IgnoreCase = True
.Global = True
'could be any Range , .Range.Text , or selection object
ActiveDocument.Range = .Replace(ActiveDocument.Range, After)
End With
End Sub
However, this wipes the document of all formatting.
Word will not preserve formatting character by character even if the strings are of the same length or indeed the same string, so ActiveDocument.Range = ActiveDocument.Range
or Selection.Text=Selection.Text
will wipe all formatting (or more accurately, format the whole range the same as the first character in the range, and add a carriage return). Upon reflection, this behavior isn't so surprising.
To solve this, the following code runs a regex find, then loops through the matches and runs .replace
only on the range where the match is found. This then, would only lose formatting if the match iself had a variety of formatting (for example an italicised word would be lost)
Hopefully the code comments make this quite transparent.
Sub RegEx(Before As String, After As String, _
Optional CaseSensitive As Boolean = False, _
Optional Location As Range = Nothing, _
Optional DebugMode As Boolean = False)
'can't declare activedocument.range in parameters
If Location Is Nothing Then Set Location = ActiveDocument.Range
Dim regexp As Object
Dim Foundmatches As Object
Dim Match As Object
Dim MatchRange As Range
Dim offset As Integer: offset = 0
Set regexp = CreateObject("vbscript.regexp")
With regexp
.Pattern = Before
.IgnoreCase = Not CaseSensitive
.Global = True
'set foundmatches to collection of all regex matches
Set Foundmatches = .Execute(Location.text)
For Each Match In Foundmatches
'set matchrange to location of found string in source doc.
'offset accounts for change in length of document from already completed replacements
Set MatchRange = Location.Document _
.Range(Match.FirstIndex + offset, _
Match.FirstIndex + Match.Length + offset)
'debugging
If DebugMode Then
Debug.Print "strfound = " & Match.Value
Debug.Print "matchpoint = " & Match.FirstIndex
Debug.Print "origstrlength = " & Match.Length
Debug.Print "offset = " & offset
Debug.Print "matchrange = " & MatchRange.text
MatchRange.Select
Stop
Else
'REAL LIFE
'run the regex replace just on the range containing the regex match
MatchRange = .Replace(MatchRange, After)
'increment offset to account for change in length of document
offset = offset + MatchRange.End - MatchRange.Start - Match.Length
End If
Next
End With
End Sub
This works on simple documents, but when I run it on a real document, matchrange
ends up being at some point near the where the match was found, but not exactly right. It's not predictably off, sometimes it is to the right, and sometimes to the left. Generally the more complex the document. (links, tables of context, formatting etc.) the more wrong it ends up being.
Does anyone know why this doesn't work, and how to fix it? If I could understand why this isn't working, then I might be able to determine whether this approach can be fixed, or if I just need to try a different method.
Code includes DebugMode param which means it will just loop through the doc and highlight all matches, performing no changes. Also outputs a bunch of stuff to the console. This should be helpful for anyone kind enough to tackle this problem with me.
https://calibre-ebook.com/downloads/demos/demo.docx Here is a sample document (not produced by me) which may be useful.
@Some_Guy: thanks for asking this question, I had a similar problem and your post saved me quite a bit of time.
This is the kludge I came up with:
Sub RegEx(Before As String, After As String, _
Optional CaseSensitive As Boolean = False, _
Optional Location As Range = Nothing, _
Optional DebugMode As Boolean = False)
'can't declare activedocument.range in parameters
If Location Is Nothing Then Set Location = ActiveDocument.Range
Dim j As Long
Dim regexp As Object
Dim Foundmatches As Object
Dim Match As Object
Dim MatchRange As Range
Dim offset As Integer: offset = 0
Set regexp = CreateObject("vbscript.regexp")
With regexp
.Pattern = Before
.IgnoreCase = Not CaseSensitive
.Global = True
'set foundmatches to collection of all regex matches
Set Foundmatches = .Execute(Location.Text)
For j = Foundmatches.Count - 1 To 0 Step -1
If DebugMode = True Then
'debugging
Debug.Print Foundmatches(j), .Replace(Foundmatches(j), After)
Else
'REAL LIFE
'run a plain old find/replace on the found string and eplace strings
With ActiveDocument.Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Font.Hidden = True
.Text = Foundmatches(j)
.Replacement.Text = regexp.Replace(Foundmatches(j), After)
.Execute Replace:=wdReplaceAll
End With
End If
Next j
End With
End Sub
Basically I use a simple find/replace with strings that match each item found (and would be replaced) with a regex, would decent support for it exist in Word). Note that any text replaced takes on the formatting of the first replaced character, so if the first word is in bold, then all the replaced text will be bold.
这篇关于正则表达式Microsoft Word,而不会破坏文档格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!