不破坏文档格式的正则表达式 Microsoft Word [英] Regex Microsoft Word without destroying document formatting

查看:15
本文介绍了不破坏文档格式的正则表达式 Microsoft Word的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

众所周知,单词的查找和替换通配符"功能受到一些严重的限制.

It's well known that word's find and replace "wildcards" features suffer some severe limitations.

以下代码在 word 文档中实现了真正的正则表达式查找和替换,在其他 Stackoverflow 和 SuperUser 问题中可以找到它的变体.

The following code implements true regex find and replace in a word document, and variations on it are found in other Stackoverflow and SuperUser questions.

Sub RegEx_PlainText(Before As String, After As String)

    Dim regexp As Object
    Set regexp = CreateObject("vbscript.regexp")            

    With regexp
        .Pattern = Before
        .IgnoreCase = True
        .Global = True

         'could be any Range , .Range.Text , or selection object
         ActiveDocument.Range = .Replace(ActiveDocument.Range, After)

    End With
End Sub

然而,这会擦除所有格式的文档.

However, this wipes the document of all formatting.

即使字符串的长度相同或实际上相同的字符串,Word 也不会逐个保留格式,因此 ActiveDocument.Range = ActiveDocument.RangeSelection.Text=Selection.Text 将清除所有格式(或更准确地说,将整个范围的格式设置为与范围中的第一个字符相同,并添加回车符).仔细想想,这种行为并不奇怪.

Word will not preserve formatting character by character even if the strings are of the same length or indeed the same string, so ActiveDocument.Range = ActiveDocument.Range or Selection.Text=Selection.Text will wipe all formatting (or more accurately, format the whole range the same as the first character in the range, and add a carriage return). Upon reflection, this behavior isn't so surprising.

为了解决这个问题,下面的代码运行一个正则表达式查找,然后遍历匹配项并仅在找到匹配项的范围内运行 .replace., 只有在匹配本身有多种格式时才会丢失格式(例如一个斜体字会丢失)

To solve this, the following code runs a regex find, then loops through the matches and runs .replace only on the range where the match is found. This then, would only lose formatting if the match iself had a variety of formatting (for example an italicised word would be lost)

希望代码注释能让这一切变得透明.

Hopefully the code comments make this quite transparent.

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

   With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.text)

        For Each Match In Foundmatches

            'set matchrange to location of found string in source doc.
            'offset accounts for change in length of  document from already completed replacements
            Set MatchRange = Location.Document _
                   .Range(Match.FirstIndex + offset, _
                          Match.FirstIndex + Match.Length + offset)

            'debugging
            If DebugMode Then
                    Debug.Print "strfound      = " & Match.Value
                    Debug.Print "matchpoint    = " & Match.FirstIndex
                    Debug.Print "origstrlength = " & Match.Length
                    Debug.Print "offset        = " & offset
                    Debug.Print "matchrange    = " & MatchRange.text
                    MatchRange.Select
                Stop

            Else
            'REAL LIFE
                'run the regex replace just on the range containing the regex match
                MatchRange = .Replace(MatchRange, After)

                'increment offset to account for change in length of document
                offset = offset + MatchRange.End - MatchRange.Start - Match.Length
            End If
        Next
    End With
End Sub

这适用于简单的文档,但是当我在真实文档上运行它时,matchrange 最终位于靠近找到匹配位置的某个点,但不完全正确. 它不是可预见的关闭,有时在右边,有时在左边.一般文档越复杂.(链接、上下文表、格式等)结果越错误.

This works on simple documents, but when I run it on a real document, matchrange ends up being at some point near the where the match was found, but not exactly right. It's not predictably off, sometimes it is to the right, and sometimes to the left. Generally the more complex the document. (links, tables of context, formatting etc.) the more wrong it ends up being.

有谁知道为什么这不起作用,以及如何解决它?如果我能理解为什么这不起作用,那么我也许能够确定这种方法是否可以修复,或者如果我只需要尝试不同的方法.

Does anyone know why this doesn't work, and how to fix it? If I could understand why this isn't working, then I might be able to determine whether this approach can be fixed, or if I just need to try a different method.

代码包含 DebugMode 参数,这意味着它只会循环遍历文档并突出显示所有匹配项,不执行任何更改. 还会向控制台输出一堆内容.这应该对任何愿意和我一起解决这个问题的人有所帮助.

Code includes DebugMode param which means it will just loop through the doc and highlight all matches, performing no changes. Also outputs a bunch of stuff to the console. This should be helpful for anyone kind enough to tackle this problem with me.

https://calibre-ebook.com/downloads/demos/demo.docx 这是一个可能有用的示例文档(不是我制作的).

https://calibre-ebook.com/downloads/demos/demo.docx Here is a sample document (not produced by me) which may be useful.

推荐答案

@Some_Guy:感谢您提出这个问题,我遇到了类似的问题,您的帖子为我节省了很多时间.

@Some_Guy: thanks for asking this question, I had a similar problem and your post saved me quite a bit of time.

这是我想出的混搭:

Sub RegEx(Before As String, After As String, _
          Optional CaseSensitive As Boolean = False, _
          Optional Location As Range = Nothing, _
          Optional DebugMode As Boolean = False)

    'can't declare activedocument.range in parameters
    If Location Is Nothing Then Set Location = ActiveDocument.Range

    Dim j As Long
    Dim regexp As Object
    Dim Foundmatches As Object
    Dim Match As Object
    Dim MatchRange As Range
    Dim offset As Integer: offset = 0
    Set regexp = CreateObject("vbscript.regexp")

    With regexp
        .Pattern = Before
        .IgnoreCase = Not CaseSensitive
        .Global = True

        'set foundmatches to collection of all regex matches
        Set Foundmatches = .Execute(Location.Text)
        For j = Foundmatches.Count - 1 To 0 Step -1

            If DebugMode = True Then
                'debugging
                Debug.Print Foundmatches(j), .Replace(Foundmatches(j), After)
            Else
                'REAL LIFE

                'run a plain old find/replace on the found string and eplace strings
                With ActiveDocument.Range.Find
                    .ClearFormatting
                    .Replacement.ClearFormatting
                    .Replacement.Font.Hidden = True
                    .Text = Foundmatches(j)
                    .Replacement.Text = regexp.Replace(Foundmatches(j), After)
                    .Execute Replace:=wdReplaceAll
                End With
            End If
        Next j
    End With
End Sub

基本上,我使用一个简单的查找/替换字符串来匹配找到(并将被替换)的每个项目与正则表达式,Word 中是否存在对它的体面支持).请注意,任何替换的文本都采用第一个替换字符的格式,因此如果第一个单词是粗体,则所有替换的文本都将是粗体.

Basically I use a simple find/replace with strings that match each item found (and would be replaced) with a regex, would decent support for it exist in Word). Note that any text replaced takes on the formatting of the first replaced character, so if the first word is in bold, then all the replaced text will be bold.

这篇关于不破坏文档格式的正则表达式 Microsoft Word的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆