正则表达式Microsoft Word,而不会破坏文档格式 [英] Regex Microsoft Word without destroying document formatting

查看:110
本文介绍了正则表达式Microsoft Word,而不会破坏文档格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知,单词的查找和替换通配符"功能受到一些严重的限制.

以下代码在word文档中实现了真正的正则表达式查找和替换,并且在其他Stackoverflow和SuperUser问题中也可以找到其正则表达式.

Sub RegEx_PlainText(Before As String, After As String)

    Dim regexp As Object
    Set regexp = CreateObject("vbscript.regexp")            

    With regexp
        .Pattern = Before
        .IgnoreCase = True
        .Global = True

         'could be any Range , .Range.Text , or selection object
         ActiveDocument.Range = .Replace(ActiveDocument.Range, After)

    End With
End Sub

但是,这会擦除所有格式的文档.

即使字符串长度相同或实际上是相同的字符串,Word也不会保留每个字符的格式设置,因此ActiveDocument.Range = ActiveDocument.RangeSelection.Text=Selection.Text将擦除所有格式设置(或更准确地说,将整个范围的格式设置为与范围内的第一个字符,并添加回车符).一经思考,这种行为就不足为奇了.

要解决此问题,以下代码将运行正则表达式查找,然后在匹配项中循环并仅在找到匹配项的范围内运行.replace.然后,仅当出现以下情况时,格式才会丢失匹配的iself具有多种格式(例如,斜体字会丢失)

希望代码注释使此操作非常透明.

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

   With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.text)

        For Each Match In Foundmatches

            'set matchrange to location of found string in source doc.
            'offset accounts for change in length of  document from already completed replacements
            Set MatchRange = Location.Document _
                   .Range(Match.FirstIndex + offset, _
                          Match.FirstIndex + Match.Length + offset)

            'debugging
            If DebugMode Then
                    Debug.Print "strfound      = " & Match.Value
                    Debug.Print "matchpoint    = " & Match.FirstIndex
                    Debug.Print "origstrlength = " & Match.Length
                    Debug.Print "offset        = " & offset
                    Debug.Print "matchrange    = " & MatchRange.text
                    MatchRange.Select
                Stop

            Else
            'REAL LIFE
                'run the regex replace just on the range containing the regex match
                MatchRange = .Replace(MatchRange, After)

                'increment offset to account for change in length of document
                offset = offset + MatchRange.End - MatchRange.Start - Match.Length
            End If
        Next
    End With
End Sub

这适用于简单文档,但是当我在真实文档上运行它时,matchrange最终会出现在找到匹配项的某个位置,但并不完全正确.可以预测,有时它在右边,有时在左边.通常,文档越复杂. (链接,上下文表,格式等),最终导致的错误越多.

有人知道为什么这种方法不起作用以及如何解决吗?如果我能理解为什么这种方法不起作用,那么我也许能够确定这种方法是否可以解决,或者如果我只需要尝试其他方法.

代码包含DebugMode参数,这意味着它将仅遍历文档并突出显示所有匹配项,而不会执行任何更改.还将输出一些内容到控制台.这对足够善良地与我一起解决这个问题的人都应该有所帮助.

https://calibre-ebook.com/downloads/demos/demo.docx 这是一个示例文档(不是由我制作的),可能会有用.

解决方案

@Some_Guy:感谢您提出这个问题,我遇到了类似的问题,您的帖子为我节省了很多时间.

这是我想出的缺点:

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim j As Long
    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

    With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.Text)
        For j = Foundmatches.Count - 1 To 0 Step -1

            If DebugMode = True Then
                'debugging
                Debug.Print Foundmatches(j), .Replace(Foundmatches(j), After)
            Else
                'REAL LIFE

                'run a plain old find/replace on the found string and eplace strings
                With ActiveDocument.Range.Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Replacement.Font.Hidden = True
                    .Text = Foundmatches(j)
                    .Replacement.Text = regexp.Replace(Foundmatches(j), After)
                    .Execute Replace:=wdReplaceAll
                End With
            End If
        Next j
    End With
End Sub

基本上,我使用一个简单的查找/替换字符串,该字符串与用正则表达式匹配找到的每个项目(并将被替换)匹配,并且在Word中会提供对它的体面支持).请注意,任何替换的文本均采用第一个替换字符的格式,因此,如果第一个单词为粗体,则所有替换的文本均将为粗体.

It's well known that word's find and replace "wildcards" features suffer some severe limitations.

The following code implements true regex find and replace in a word document, and variations on it are found in other Stackoverflow and SuperUser questions.

Sub RegEx_PlainText(Before As String, After As String)

    Dim regexp As Object
    Set regexp = CreateObject("vbscript.regexp")            

    With regexp
        .Pattern = Before
        .IgnoreCase = True
        .Global = True

         'could be any Range , .Range.Text , or selection object
         ActiveDocument.Range = .Replace(ActiveDocument.Range, After)

    End With
End Sub

However, this wipes the document of all formatting.

Word will not preserve formatting character by character even if the strings are of the same length or indeed the same string, so ActiveDocument.Range = ActiveDocument.Range or Selection.Text=Selection.Text will wipe all formatting (or more accurately, format the whole range the same as the first character in the range, and add a carriage return). Upon reflection, this behavior isn't so surprising.

To solve this, the following code runs a regex find, then loops through the matches and runs .replace only on the range where the match is found. This then, would only lose formatting if the match iself had a variety of formatting (for example an italicised word would be lost)

Hopefully the code comments make this quite transparent.

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

   With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.text)

        For Each Match In Foundmatches

            'set matchrange to location of found string in source doc.
            'offset accounts for change in length of  document from already completed replacements
            Set MatchRange = Location.Document _
                   .Range(Match.FirstIndex + offset, _
                          Match.FirstIndex + Match.Length + offset)

            'debugging
            If DebugMode Then
                    Debug.Print "strfound      = " & Match.Value
                    Debug.Print "matchpoint    = " & Match.FirstIndex
                    Debug.Print "origstrlength = " & Match.Length
                    Debug.Print "offset        = " & offset
                    Debug.Print "matchrange    = " & MatchRange.text
                    MatchRange.Select
                Stop

            Else
            'REAL LIFE
                'run the regex replace just on the range containing the regex match
                MatchRange = .Replace(MatchRange, After)

                'increment offset to account for change in length of document
                offset = offset + MatchRange.End - MatchRange.Start - Match.Length
            End If
        Next
    End With
End Sub

This works on simple documents, but when I run it on a real document, matchrange ends up being at some point near the where the match was found, but not exactly right. It's not predictably off, sometimes it is to the right, and sometimes to the left. Generally the more complex the document. (links, tables of context, formatting etc.) the more wrong it ends up being.

Does anyone know why this doesn't work, and how to fix it? If I could understand why this isn't working, then I might be able to determine whether this approach can be fixed, or if I just need to try a different method.

Code includes DebugMode param which means it will just loop through the doc and highlight all matches, performing no changes. Also outputs a bunch of stuff to the console. This should be helpful for anyone kind enough to tackle this problem with me.

https://calibre-ebook.com/downloads/demos/demo.docx Here is a sample document (not produced by me) which may be useful.

解决方案

@Some_Guy: thanks for asking this question, I had a similar problem and your post saved me quite a bit of time.

This is the kludge I came up with:

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim j As Long
    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

    With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.Text)
        For j = Foundmatches.Count - 1 To 0 Step -1

            If DebugMode = True Then
                'debugging
                Debug.Print Foundmatches(j), .Replace(Foundmatches(j), After)
            Else
                'REAL LIFE

                'run a plain old find/replace on the found string and eplace strings
                With ActiveDocument.Range.Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Replacement.Font.Hidden = True
                    .Text = Foundmatches(j)
                    .Replacement.Text = regexp.Replace(Foundmatches(j), After)
                    .Execute Replace:=wdReplaceAll
                End With
            End If
        Next j
    End With
End Sub

Basically I use a simple find/replace with strings that match each item found (and would be replaced) with a regex, would decent support for it exist in Word). Note that any text replaced takes on the formatting of the first replaced character, so if the first word is in bold, then all the replaced text will be bold.

这篇关于正则表达式Microsoft Word,而不会破坏文档格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆