打开XML-在文档模板中查找并替换多个占位符 [英] Open XML - find and replace multiple placeholders in document template

查看:108
本文介绍了打开XML-在文档模板中查找并替换多个占位符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有很多关于此主题的帖子,但是似乎没有一个帖子可以处理这个特定问题. 我正在尝试制作一个小型的通用文档生成器POC. 我正在使用Open XML.

I know there are many posts on SO about this topic, but none seems to treat this particular issue. I'm trying to make a small generic document generator POC. I'm using Open XML.

代码如下:

   private static void ReplacePlaceholders<T>(string templateDocumentPath, T templateObject)
        where T : class
    {

        using (var templateDocument = WordprocessingDocument.Open(templateDocumentPath, true))
        {
            string templateDocumentText = null;
            using (var streamReader = new StreamReader(templateDocument.MainDocumentPart.GetStream()))
            {
                templateDocumentText = streamReader.ReadToEnd();
            }

            var props = templateObject.GetType().GetProperties();
            foreach (var prop in props)
            {
                var regexText = new Regex($"{prop.Name}");
                templateDocumentText =
                    regexText.Replace(templateDocumentText, prop.GetValue(templateObject).ToString());
            }

            using var streamWriter = new StreamWriter(templateDocument.MainDocumentPart.GetStream(FileMode.Create));
                streamWriter.Write(templateDocumentText);
        }
    }

该代码按预期工作. 问题如下:

The code works as intended. Problem is the following:

StreamReader.ReadToEnd()在标签之间分割我的占位符,所以我的Replace方法仅替换不会被分割的单词.

StreamReader.ReadToEnd() splits my placeholders between tags, so my Replace method, replaces only the words which won't get split.

在这种情况下,我的代码将搜索"Firstname"一词,但会找到"irstname",因此不会替换它.

In this case, my code will search for the word "Firstname" but will find "irstname" instead, so it won't replace it.

有没有办法逐字扫描整个.docx并替换它们?

Is there any way to scan the whole .docx word by word and replace them?

(编辑)部分解决方案/解决方法,我发现: -我注意到您必须立即在.docx中写入占位符(无需重新编辑).例如,如果我写"firstname",然后再将其修改为"Firstname",它将把单词拆分为"F""irstname".如果没有editng,它将不会分裂.

(edit) A partial solution / workaround I found: - I noticed that you have to write the placeholder in the .docx at once (without reediting it). For example if I write "firstname", then come back and modify it to "Firstname" it will split the word into "F" "irstname". Without editng it will be unsplitted.

推荐答案

TLDR

简而言之,解决问题的方法是使用OpenXmlRegex实用工具类> Open-Xml-PowerTools ,如下面进一步的单元测试所示.

TLDR

In short words, the solution to your problem is to use the OpenXmlRegex utility class of the Open-Xml-PowerTools as demonstrated in the unit test further below.

使用Open XML,您可以通过多种方式表示相同的文本.如果Microsoft Word参与了该Open XML标记的创建,则为产生该文本而进行的编辑将发挥重要作用.这是因为Word会跟踪在哪个编辑会话中进行了哪些编辑.因此,例如,在以下极端情况下显示的w:p(Paragraph)元素表示的文本完全相同.这两个示例之间的任何事情都是可能的,因此任何真正的解决方案都必须能够解决这个问题.

Using Open XML, you can represent the same text in multiple ways. If Microsoft Word is involved in creating that Open XML markup, the edits made to produce that text will play an important part. This is because Word keeps track of which edits were made in which editing session. So, for example, the w:p (Paragraph) elements shown in the following extreme scenarios represent precisely the same text. And anything between those two examples is possible, so any real solution must be able to deal with that.

以下标记非常简单:

<w:p>
  <w:r>
    <w:t>Firstname</w:t>
  </w:r>
</w:p>

极端情况2:单字符w:rw:t元素

虽然通常找不到以下标记,但它表示每个字符都有其自己的w:rw:t元素的理论极限.

Extreme Scenario 2: Single-Character w:r and w:t Elements

While you typically won't find the following markup, it represents the theoretical extreme in which each character has its own w:r and w:t element.

<w:p>
  <w:r>
    <w:t>F</w:t>
    <w:t>i</w:t>
    <w:t>r</w:t>
    <w:t>s</w:t>
    <w:t>t</w:t>
    <w:t>n</w:t>
    <w:t>a</w:t>
    <w:t>m</w:t>
    <w:t>e</w:t>
  </w:r>
</w:p>

您可能会问,为什么在实践中没有出现此极端示例?答案是,如果您想自己动手,它在解决方案中起着至关重要的作用.

Why did I use this extreme example if it does not occur in practice, you might ask? The answer is that it plays an essential role in the solution in case you want to roll your own.

要正确执行此操作,您必须:

To do it right, you must:

  1. 将段落(w:p)的运行(w:r)转换为单字符运行(即,具有一个单字符w:t或每个w:symw:r元素),并保留运行属性(w:rPr);
  2. 在这些单字符运行中执行搜索和替换操作(使用其他一些技巧);和
  3. 考虑到搜索和替换操作所产生的运行的潜在不同的运行属性(w:rPr),将得到的运行转换回表示文本及其格式所需的最少的合并"运行.
  1. transform the runs (w:r) of your paragraph (w:p) into single-character runs (i.e., w:r elements with one single-character w:t or one w:sym each), retaining the run properties (w:rPr);
  2. perform the search-and-replace operation on those single-character runs (using a few other tricks); and
  3. considering the potentially different run properties (w:rPr) of the runs resulting from the search-and-replace action, transform such resulting runs back into the fewest number of "coalesced" runs required to represent the text and its formatting.

替换文本时,不应丢失或更改不受替换影响的文本格式.您也不应删除不受影响的字段或内容控件(w:sdt).嗯,顺便说一句,不要忘记w:insw:del ...

When replacing text, you should not lose or alter the formatting of the text that is unaffected by your replacement. You should also not remove unaffected fields or content controls (w:sdt). Ah, and by the way, don't forget revision markup such as w:ins and w:del ...

好消息是您不必自己动手. Eric White的 Open-Xml-PowerTools OpenXmlRegex实用工具类实现了上述算法(和更多).我已经在大型RFP和签约场景中成功使用了它,并且对此做出了贡献.

The good news is that you don't have to roll your own. The OpenXmlRegex utility class of Eric White's Open-Xml-PowerTools implements the above algorithm (and more). I've successfully used it in large-scale RFP and contracting scenarios and also contributed back to it.

在本节中,我将演示如何使用Open-Xml-PowerTools将占位符文本"Firstname"(如在问题中)替换为示例输出文档中的各种名字(使用"Bernie") ).

In this section, I'm going to demonstrate how to use the Open-Xml-PowerTools to replace the placeholder text "Firstname" (as in the question) with various first names (using "Bernie" in the sample output document).

首先让我们看一下下面的示例文档,该文档是由稍后显示的单元测试创​​建的.请注意,我们已经格式化了运行和符号.就像在问题中一样,占位符"Firstname"被分为两个运行,即"F"和"irstname".

Let's first look at the following sample document, which is created by the unit test shown a little later. Note that we have formatted runs and a symbol. As in the question, the placeholder "Firstname" is split into two runs, i.e., "F" and "irstname".

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>F</w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>irstname</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

所需的输出文档

以下是正确执行以下操作后将"Firstname"替换为"Bernie"的文档.请注意,格式保留不变,并且我们没有丢失符号.

Desired Output Document

The following is the document resulting from replacing "Firstname" with "Bernie" if you do it right. Note that the formatting is retained and that we did not lose our symbol.

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>Bernie</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

样品用量

接下来,这是一个完整的单元测试,演示了如何使用OpenXmlRegex.Replace()方法,并注意该示例仅显示了多个重载之一.单元测试还证明了这一点:

Sample Usage

Next, here's a full unit test that demonstrates how to use the OpenXmlRegex.Replace() method, noting that the example only shows one of the multiple overloads. The unit test also demonstrates that this works:

  • 无论占位符(例如"Firstname")如何在一个或多个行中分配;
  • 同时保留占位符的格式;
  • 在不丢失其他运行格式的情况下;和
  • 不丢失符号(或其他任何标记,例如字段或内容控件).
[Theory]
[InlineData("1 Run", "Firstname", new[] { "Firstname" }, "Albert")]
[InlineData("2 Runs", "Firstname", new[] { "F", "irstname" }, "Bernie")]
[InlineData("9 Runs", "Firstname", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" }, "Charly")]
public void Replace_PlaceholderInOneOrMoreRuns_SuccessfullyReplaced(
    string example,
    string propName,
    IEnumerable<string> runTexts,
    string replacement)
{
    // Create a test WordprocessingDocument on a MemoryStream.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);

    // Save the Word document before replacing the placeholder.
    // You can use this to inspect the input Word document.
    File.WriteAllBytes($"{example} before Replacing.docx", stream.ToArray());

    // Replace the placeholder identified by propName with the replacement text.
    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
    {
        // Read the root element, a w:document in this case.
        // Note that GetXElement() is a shortcut for GetXDocument().Root.
        // This caches the root element and we can later write it back
        // to the main document part, using the PutXDocument() method.
        XElement document = wordDocument.MainDocumentPart.GetXElement();

        // Specify the parameters of the OpenXmlRegex.Replace() method,
        // noting that the replacement is given as a parameter.
        IEnumerable<XElement> content = document.Descendants(W.p);
        var regex = new Regex(propName);

        // Perform the replacement, thereby modifying the root element.
        OpenXmlRegex.Replace(content, regex, replacement, null);

        // Write the changed root element back to the main document part.
        wordDocument.MainDocumentPart.PutXDocument();
    }

    // Assert that we have done it right.
    AssertReplacementWasSuccessful(stream, replacement);

    // Save the Word document after having replaced the placeholder.
    // You can use this to inspect the output Word document.
    File.WriteAllBytes($"{example} after Replacing.docx", stream.ToArray());
}

private static MemoryStream CreateWordprocessingDocument(IEnumerable<string> runTexts)
{
    var stream = new MemoryStream();
    const WordprocessingDocumentType type = WordprocessingDocumentType.Document;

    using (WordprocessingDocument wordDocument = WordprocessingDocument.Create(stream, type))
    {
        MainDocumentPart mainDocumentPart = wordDocument.AddMainDocumentPart();
        mainDocumentPart.PutXDocument(new XDocument(CreateDocument(runTexts)));
    }

    return stream;
}

private static XElement CreateDocument(IEnumerable<string> runTexts)
{
    // Produce a w:document with a single w:p that contains:
    // (1) one italic run with some lead-in, i.e., "Hello " in this example;
    // (2) one or more bold runs for the placeholder, which might or might not be split;
    // (3) one run with just a space; and
    // (4) one run with a symbol (i.e., a Wingdings smiley face).
    return new XElement(W.document,
        new XAttribute(XNamespace.Xmlns + "w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"),
        new XElement(W.body,
            new XElement(W.p,
                new XElement(W.r,
                    new XElement(W.rPr,
                        new XElement(W.i)),
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        "Hello ")),
                runTexts.Select(rt =>
                    new XElement(W.r,
                        new XElement(W.rPr,
                            new XElement(W.b)),
                        new XElement(W.t, rt))),
                new XElement(W.r,
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        " ")),
                new XElement(W.r,
                    new XElement(W.sym,
                        new XAttribute(W.font, "Wingdings"),
                        new XAttribute(W._char, "F04A"))))));
}

private static void AssertReplacementWasSuccessful(MemoryStream stream, string replacement)
{
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    XElement document = wordDocument.MainDocumentPart.GetXElement();
    XElement paragraph = document.Descendants(W.p).Single();
    List<XElement> runs = paragraph.Elements(W.r).ToList();

    // We have the expected number of runs, i.e., the lead-in, the first name,
    // a space character, and the symbol.
    Assert.Equal(4, runs.Count);

    // We still have the lead-in "Hello " and it is still formatted in italics.
    Assert.True(runs[0].Value == "Hello " && runs[0].Elements(W.rPr).Elements(W.i).Any());

    // We have successfully replaced our "Firstname" placeholder and the
    // concrete first name is formatted in bold, exactly like the placeholder.
    Assert.True(runs[1].Value == replacement && runs[1].Elements(W.rPr).Elements(W.b).Any());

    // We still have the space between the first name and the symbol and it
    // is unformatted.
    Assert.True(runs[2].Value == " " && !runs[2].Elements(W.rPr).Any());

    // Finally, we still have our smiley face symbol run.
    Assert.True(IsSymbolRun(runs[3], "Wingdings", "F04A"));
}

private static bool IsSymbolRun(XElement run, string fontValue, string charValue)
{
    XElement sym = run.Elements(W.sym).FirstOrDefault();
    if (sym == null) return false;

    return (string) sym.Attribute(W.font) == fontValue &&
           (string) sym.Attribute(W._char) == charValue;
}

为什么不是内文解决方案?

虽然可能很想使用Paragraph类(或OpenXmlElement类的其他子类)的InnerText属性,但问题是您将忽略任何非文本(w:t)标记.例如,如果您的段落包含符号(w:sym元素,例如上面示例中使用的笑脸),则这些符号将丢失,因为InnerText属性未考虑它们.以下单元测试证明了这一点:

WHY IS INNERTEXT NOT THE SOLUTION?

While it might be tempting to use the InnerText property of the Paragraph class (or other subclasses of the OpenXmlElement class), the problem is that you will be ignoring any non-text (w:t) markup. For example, if your paragraph contains symbols (w:sym elements, e.g., the smiley face used in the example above), those will be lost because they are not considered by the InnerText property. The following unit test demonstrates that:

[Theory]
[InlineData("Hello Firstname ", new[] { "Firstname" })]
[InlineData("Hello Firstname ", new[] { "F", "irstname" })]
[InlineData("Hello Firstname ", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" })]
public void InnerText_ParagraphWithSymbols_SymbolIgnored(string expectedInnerText, IEnumerable<string> runTexts)
{
    // Create Word document with smiley face symbol at the end.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    Document document = wordDocument.MainDocumentPart.Document;
    Paragraph paragraph = document.Descendants<Paragraph>().Single();

    string innerText = paragraph.InnerText;

    // Note that the innerText does not contain the smiley face symbol.
    Assert.Equal(expectedInnerText, innerText);
}

请注意,在简单的用例中,您可能不需要考虑以上所有内容.但是,如果您必须处理现实生活中的文档或Microsoft Word所做的标记更改,那么您很有可能无法忽略其复杂性.等到您需要处理修订标记...

Note that you might not need to consider all of the above in simple use cases. But if you must deal with real-life documents or the markup changes made by Microsoft Word, chances are you can't ignore the complexity. And wait until you need to deal with revision markup ...

一如既往,完整的源代码可以在我的 CodeSnippets GitHub存储库中找到.查找 OpenXmlRegexTests 类.

As always, the full source code can be found in my CodeSnippets GitHub repository. Look for the OpenXmlRegexTests class.

这篇关于打开XML-在文档模板中查找并替换多个占位符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆