Open XML - 在文档模板中查找和替换多个占位符 [英] Open XML - find and replace multiple placeholders in document template

查看:23
本文介绍了Open XML - 在文档模板中查找和替换多个占位符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有很多关于这个主题的帖子,但似乎没有一个处理这个特定问题.我正在尝试制作一个小型通用文档生成器 POC.我使用的是 Open XML.

I know there are many posts on SO about this topic, but none seems to treat this particular issue. I'm trying to make a small generic document generator POC. I'm using Open XML.

代码如下:

   private static void ReplacePlaceholders<T>(string templateDocumentPath, T templateObject)
        where T : class
    {

        using (var templateDocument = WordprocessingDocument.Open(templateDocumentPath, true))
        {
            string templateDocumentText = null;
            using (var streamReader = new StreamReader(templateDocument.MainDocumentPart.GetStream()))
            {
                templateDocumentText = streamReader.ReadToEnd();
            }

            var props = templateObject.GetType().GetProperties();
            foreach (var prop in props)
            {
                var regexText = new Regex($"{prop.Name}");
                templateDocumentText =
                    regexText.Replace(templateDocumentText, prop.GetValue(templateObject).ToString());
            }

            using var streamWriter = new StreamWriter(templateDocument.MainDocumentPart.GetStream(FileMode.Create));
                streamWriter.Write(templateDocumentText);
        }
    }

代码按预期工作.问题如下:

StreamReader.ReadToEnd() 在标签之间拆分我的占位符,因此我的 Replace 方法仅替换不会拆分的单词.

StreamReader.ReadToEnd() splits my placeholders between tags, so my Replace method, replaces only the words which won't get split.

在这种情况下,我的代码将搜索单词Firstname",但会找到irstname",因此不会替换它.

In this case, my code will search for the word "Firstname" but will find "irstname" instead, so it won't replace it.

有没有办法逐字扫描整个.docx并替换它们?

Is there any way to scan the whole .docx word by word and replace them?

(编辑)部分解决方案/解决方法我发现:- 我注意到您必须立即在 .docx 中编写占位符(无需重新编辑).例如,如果我写firstname",然后返回并将其修改为Firstname",它会将单词拆分为F"irstname".如果不进行编辑,它将不会被拆分.

(edit) A partial solution / workaround I found: - I noticed that you have to write the placeholder in the .docx at once (without reediting it). For example if I write "firstname", then come back and modify it to "Firstname" it will split the word into "F" "irstname". Without editng it will be unsplitted.

推荐答案

TLDR

简而言之,解决您的问题的方法是使用 OpenXmlRegex 实用程序类="nofollow noreferrer">Open-Xml-PowerTools,如下面的单元测试所示.

TLDR

In short words, the solution to your problem is to use the OpenXmlRegex utility class of the Open-Xml-PowerTools as demonstrated in the unit test further below.

使用 Open XML,您可以以多种方式表示相同的文本.如果 Microsoft Word 参与创建 Open XML 标记,则为生成该文本所做的编辑将发挥重要作用.这是因为 Word 会跟踪在哪个编辑会话中进行了哪些编辑.因此,例如,以下极端场景中显示的 w:p (Paragraph) 元素表示完全相同的文本.这两个例子之间的任何事情都是可能的,所以任何真正的解决方案都必须能够解决这个问题.

Using Open XML, you can represent the same text in multiple ways. If Microsoft Word is involved in creating that Open XML markup, the edits made to produce that text will play an important part. This is because Word keeps track of which edits were made in which editing session. So, for example, the w:p (Paragraph) elements shown in the following extreme scenarios represent precisely the same text. And anything between those two examples is possible, so any real solution must be able to deal with that.

以下标记很好且简单:

<w:p>
  <w:r>
    <w:t>Firstname</w:t>
  </w:r>
</w:p>

极端场景 2:单字符 w:rw:t 元素

虽然您通常不会找到以下标记,但它代表了每个字符都有自己的 w:rw:t 元素的理论极限.

Extreme Scenario 2: Single-Character w:r and w:t Elements

While you typically won't find the following markup, it represents the theoretical extreme in which each character has its own w:r and w:t element.

<w:p>
  <w:r>
    <w:t>F</w:t>
    <w:t>i</w:t>
    <w:t>r</w:t>
    <w:t>s</w:t>
    <w:t>t</w:t>
    <w:t>n</w:t>
    <w:t>a</w:t>
    <w:t>m</w:t>
    <w:t>e</w:t>
  </w:r>
</w:p>

如果在实践中没有发生,我为什么要用这个极端的例子,你可能会问?答案是,如果您想推出自己的解决方案,它在解决方案中起着至关重要的作用.

Why did I use this extreme example if it does not occur in practice, you might ask? The answer is that it plays an essential role in the solution in case you want to roll your own.

要做到正确,您必须:

  1. 将段落 (w:p) 的运行 (w:r) 转换为单字符运行(即 w:r> 具有一个单字符 w:t 或一个 w:sym 的元素,保留运行属性 (w:rPr);
  2. 对这些单字符运行执行搜索和替换操作(使用一些其他技巧);和
  3. 考虑到由搜索和替换操作产生的运行的潜在不同运行属性 (w:rPr),将这些运行结果转换回最少数量的合并".表示文本及其格式所需的运行.
  1. transform the runs (w:r) of your paragraph (w:p) into single-character runs (i.e., w:r elements with one single-character w:t or one w:sym each), retaining the run properties (w:rPr);
  2. perform the search-and-replace operation on those single-character runs (using a few other tricks); and
  3. considering the potentially different run properties (w:rPr) of the runs resulting from the search-and-replace action, transform such resulting runs back into the fewest number of "coalesced" runs required to represent the text and its formatting.

替换文本时,您不应丢失或更改不受替换影响的文本格式.您也不应该删除未受影响的字段或内容控件 (w:sdt).啊,顺便说一句,不要忘记修订标记,例如 w:insw:del ...

When replacing text, you should not lose or alter the formatting of the text that is unaffected by your replacement. You should also not remove unaffected fields or content controls (w:sdt). Ah, and by the way, don't forget revision markup such as w:ins and w:del ...

好消息是您不必自己动手.Eric White 的 Open-Xml-PowerToolsOpenXmlRegex 实用程序类> 实现上述算法(以及更多).我已经成功地将其用于大型 RFP 和承包场景,并对其做出了贡献.

The good news is that you don't have to roll your own. The OpenXmlRegex utility class of Eric White's Open-Xml-PowerTools implements the above algorithm (and more). I've successfully used it in large-scale RFP and contracting scenarios and also contributed back to it.

在本节中,我将演示如何使用 Open-Xml-PowerTools 替换占位符文本名字".(如问题中所示)具有各种名字(在示例输出文档中使用Bernie").

In this section, I'm going to demonstrate how to use the Open-Xml-PowerTools to replace the placeholder text "Firstname" (as in the question) with various first names (using "Bernie" in the sample output document).

让我们先看看下面的示例文档,它是由稍后显示的单元测试创​​建的.请注意,我们已格式化运行和符号.与问题一样,占位符名字"是被分成两次运行,即F";和名字".

Let's first look at the following sample document, which is created by the unit test shown a little later. Note that we have formatted runs and a symbol. As in the question, the placeholder "Firstname" is split into two runs, i.e., "F" and "irstname".

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>F</w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>irstname</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

所需的输出文件

以下是替换Firstname"后得到的文档与伯尼"如果你做对了.请注意,格式被保留,我们没有丢失我们的符号.

Desired Output Document

The following is the document resulting from replacing "Firstname" with "Bernie" if you do it right. Note that the formatting is retained and that we did not lose our symbol.

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>Bernie</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

示例用法

接下来,这是一个完整的单元测试,演示如何使用 OpenXmlRegex.Replace() 方法,注意该示例仅显示了多个重载之一.单元测试也证明这是有效的:

Sample Usage

Next, here's a full unit test that demonstrates how to use the OpenXmlRegex.Replace() method, noting that the example only shows one of the multiple overloads. The unit test also demonstrates that this works:

  • 无论占位符(例如名字")如何在一次或多次运行中拆分;
  • 同时保留占位符的格式;
  • 不会丢失其他运行的格式;和
  • 不会丢失符号(或任何其他标记,例如字段或内容控件).
[Theory]
[InlineData("1 Run", "Firstname", new[] { "Firstname" }, "Albert")]
[InlineData("2 Runs", "Firstname", new[] { "F", "irstname" }, "Bernie")]
[InlineData("9 Runs", "Firstname", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" }, "Charly")]
public void Replace_PlaceholderInOneOrMoreRuns_SuccessfullyReplaced(
    string example,
    string propName,
    IEnumerable<string> runTexts,
    string replacement)
{
    // Create a test WordprocessingDocument on a MemoryStream.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);

    // Save the Word document before replacing the placeholder.
    // You can use this to inspect the input Word document.
    File.WriteAllBytes($"{example} before Replacing.docx", stream.ToArray());

    // Replace the placeholder identified by propName with the replacement text.
    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
    {
        // Read the root element, a w:document in this case.
        // Note that GetXElement() is a shortcut for GetXDocument().Root.
        // This caches the root element and we can later write it back
        // to the main document part, using the PutXDocument() method.
        XElement document = wordDocument.MainDocumentPart.GetXElement();

        // Specify the parameters of the OpenXmlRegex.Replace() method,
        // noting that the replacement is given as a parameter.
        IEnumerable<XElement> content = document.Descendants(W.p);
        var regex = new Regex(propName);

        // Perform the replacement, thereby modifying the root element.
        OpenXmlRegex.Replace(content, regex, replacement, null);

        // Write the changed root element back to the main document part.
        wordDocument.MainDocumentPart.PutXDocument();
    }

    // Assert that we have done it right.
    AssertReplacementWasSuccessful(stream, replacement);

    // Save the Word document after having replaced the placeholder.
    // You can use this to inspect the output Word document.
    File.WriteAllBytes($"{example} after Replacing.docx", stream.ToArray());
}

private static MemoryStream CreateWordprocessingDocument(IEnumerable<string> runTexts)
{
    var stream = new MemoryStream();
    const WordprocessingDocumentType type = WordprocessingDocumentType.Document;

    using (WordprocessingDocument wordDocument = WordprocessingDocument.Create(stream, type))
    {
        MainDocumentPart mainDocumentPart = wordDocument.AddMainDocumentPart();
        mainDocumentPart.PutXDocument(new XDocument(CreateDocument(runTexts)));
    }

    return stream;
}

private static XElement CreateDocument(IEnumerable<string> runTexts)
{
    // Produce a w:document with a single w:p that contains:
    // (1) one italic run with some lead-in, i.e., "Hello " in this example;
    // (2) one or more bold runs for the placeholder, which might or might not be split;
    // (3) one run with just a space; and
    // (4) one run with a symbol (i.e., a Wingdings smiley face).
    return new XElement(W.document,
        new XAttribute(XNamespace.Xmlns + "w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"),
        new XElement(W.body,
            new XElement(W.p,
                new XElement(W.r,
                    new XElement(W.rPr,
                        new XElement(W.i)),
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        "Hello ")),
                runTexts.Select(rt =>
                    new XElement(W.r,
                        new XElement(W.rPr,
                            new XElement(W.b)),
                        new XElement(W.t, rt))),
                new XElement(W.r,
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        " ")),
                new XElement(W.r,
                    new XElement(W.sym,
                        new XAttribute(W.font, "Wingdings"),
                        new XAttribute(W._char, "F04A"))))));
}

private static void AssertReplacementWasSuccessful(MemoryStream stream, string replacement)
{
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    XElement document = wordDocument.MainDocumentPart.GetXElement();
    XElement paragraph = document.Descendants(W.p).Single();
    List<XElement> runs = paragraph.Elements(W.r).ToList();

    // We have the expected number of runs, i.e., the lead-in, the first name,
    // a space character, and the symbol.
    Assert.Equal(4, runs.Count);

    // We still have the lead-in "Hello " and it is still formatted in italics.
    Assert.True(runs[0].Value == "Hello " && runs[0].Elements(W.rPr).Elements(W.i).Any());

    // We have successfully replaced our "Firstname" placeholder and the
    // concrete first name is formatted in bold, exactly like the placeholder.
    Assert.True(runs[1].Value == replacement && runs[1].Elements(W.rPr).Elements(W.b).Any());

    // We still have the space between the first name and the symbol and it
    // is unformatted.
    Assert.True(runs[2].Value == " " && !runs[2].Elements(W.rPr).Any());

    // Finally, we still have our smiley face symbol run.
    Assert.True(IsSymbolRun(runs[3], "Wingdings", "F04A"));
}

private static bool IsSymbolRun(XElement run, string fontValue, string charValue)
{
    XElement sym = run.Elements(W.sym).FirstOrDefault();
    if (sym == null) return false;

    return (string) sym.Attribute(W.font) == fontValue &&
           (string) sym.Attribute(W._char) == charValue;
}

为什么内文不是解决方案?

虽然使用 Paragraph 类(或 OpenXmlElement 类的其他子类)的 InnerText 属性可能很诱人,但问题是是您将忽略任何非文本 (w:t) 标记.例如,如果您的段落包含符号(w:sym 元素,例如上面示例中使用的笑脸),则这些将丢失,因为 InnerText属性.以下单元测试表明:

WHY IS INNERTEXT NOT THE SOLUTION?

While it might be tempting to use the InnerText property of the Paragraph class (or other subclasses of the OpenXmlElement class), the problem is that you will be ignoring any non-text (w:t) markup. For example, if your paragraph contains symbols (w:sym elements, e.g., the smiley face used in the example above), those will be lost because they are not considered by the InnerText property. The following unit test demonstrates that:

[Theory]
[InlineData("Hello Firstname ", new[] { "Firstname" })]
[InlineData("Hello Firstname ", new[] { "F", "irstname" })]
[InlineData("Hello Firstname ", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" })]
public void InnerText_ParagraphWithSymbols_SymbolIgnored(string expectedInnerText, IEnumerable<string> runTexts)
{
    // Create Word document with smiley face symbol at the end.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    Document document = wordDocument.MainDocumentPart.Document;
    Paragraph paragraph = document.Descendants<Paragraph>().Single();

    string innerText = paragraph.InnerText;

    // Note that the innerText does not contain the smiley face symbol.
    Assert.Equal(expectedInnerText, innerText);
}

请注意,在简单的用例中,您可能不需要考虑以上所有内容.但是,如果您必须处理现实生活中的文档或 Microsoft Word 所做的标记更改,则您可能无法忽视其复杂性.并等到您需要处理修订标记...

Note that you might not need to consider all of the above in simple use cases. But if you must deal with real-life documents or the markup changes made by Microsoft Word, chances are you can't ignore the complexity. And wait until you need to deal with revision markup ...

一如既往,可以在我的 CodeSnippets GitHub 存储库中找到完整的源代码.查找 OpenXmlRegexTests 类.

As always, the full source code can be found in my CodeSnippets GitHub repository. Look for the OpenXmlRegexTests class.

这篇关于Open XML - 在文档模板中查找和替换多个占位符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆