OpenXML 标签搜索 [英] OpenXML tag search

查看:17
本文介绍了OpenXML 标签搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 .NET 应用程序,该应用程序应该读取 200 页附近的 .docx 文件(通过 DocumentFormat.OpenXML 2.5)以查找文档应包含的某些标签的所有出现.明确地说,我不是在寻找 OpenXML 标签,而是寻找应该由文档编写者设置到文档中的标签,作为我需要在第二阶段填写的值的占位符.此类标签应采用以下格式:

 

(其中 TAG 可以是任意字符序列).正如我所说,我必须找到这些标签的所有出现,加上(如果可能)定位已找到标签出现的页面".我在网上找到了一些东西,但不止一次的基本方法是将文件的所有内容转储到一个字符串中,然后在不考虑 .docx 编码的情况下查看这些字符串.这要么导致误报,要么根本不匹配(虽然测试 .docx 文件包含多个标签),但其他示例可能有点超出我对 OpenXML 的了解.查找此类标签的正则表达式模式应该是这样的:

该标签可以在整个文档中找到(表格、文本、段落,以及页眉和页脚).

我在 Visual Studio 2013 .NET 4.5 中编码,但如果需要我可以回来.附言我更喜欢不使用 Office Interop API 的代码,因为目标平台不会运行 Office.

我可以生成的最小的 .docx 示例将其存储在文档中

<w:body><w:p w:rsidR="00CA7780" w:rsidRDefault="00815E5D"><w:pPr><w:rPr><w:lang w:val="en-GB"/></w:rPr></w:pPr><w:r><w:rPr><w:lang w:val="en-GB"/></w:rPr><w:t>TRY</w:t></w:r></w:p><w:p w:rsidR="00815E5D" w:rsidRDefault="00815E5D"><w:pPr><w:rPr><w:lang w:val="en-GB"/></w:rPr></w:pPr><w:proofErr w:type="gramStart"/><w:r><w:rPr><w:lang w:val="en-GB"/></w:rPr><w:t><!TAG1</w:t></w:r><w:proofErr w:type="gramEnd"/><w:r><w:rPr><w:lang w:val="en-GB"/></w:rPr><w:t>!&gt;</w:t></w:r></w:p><w:p w:rsidR="00815E5D" w:rsidRPr="00815E5D" w:rsidRDefault="00815E5D"><w:pPr><w:rPr><w:lang w:val="en-GB"/></w:rPr></w:pPr><w:r><w:rPr><w:lang w:val="en-GB"/></w:rPr><w:t>TRY2</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr w:rsidR="00815E5D" w:rsidRPr="00815E5D"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1417" w:right="1134" w:bottom="1134" w:left="1134" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

最好的问候,迈克

解决方案

尝试查找标签的问题在于,单词并不总是以它们在 Word 中显示的格式出现在底层 XML 中.例如,在您的示例 XML 中,<!TAG1!> 标记被拆分为多次运行,如下所示:

<w:rPr><w:lang w:val="en-GB"/></w:rPr><w:t><!TAG1</w:t></w:r><w:proofErr w:type="gramEnd"/><w:r><w:rPr><w:lang w:val="en-GB"/></w:rPr><w:t>!&gt;</w:t></w:r>

正如评论中所指出的,这有时是由拼写和语法检查器引起的,但这并不是导致它的全部原因.例如,在标签的某些部分使用不同的样式也可能导致它.

处理此问题的一种方法是找到 ParagraphInnerText 并将其与您的 Regex 进行比较.InnerText 属性将返回段落的纯文本,没有任何格式或基础文档中的其他 XML 妨碍.

一旦您有了标签,下一个问题就是替换文本.由于上述原因,您不能只用一些新文本替换 InnerText,因为不清楚文本的哪些部分属于哪个 Run.解决此问题的最简单方法是删除任何现有的 Run 并添加一个新的 Run,其 Text 属性包含新文本.>

以下代码显示了查找标签并立即替换它们,而不是像您在问题中建议的那样使用两次传递.老实说,这只是为了使示例更简单.它应该显示您需要的一切.

private static void ReplaceTags(string filename){Regex regex = new Regex("<!(.)*?!>", RegexOptions.Compiled);使用 (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filename, true)){//获取标题部分并替换那里的标签foreach(wordDocument.MainDocumentPart.HeaderParts 中的 HeaderPart headerPart){ReplaceParagraphParts(headerPart.Header, regex);}//现在做文档ReplaceParagraphParts(wordDocument.MainDocumentPart.Document, regex);//现在替换页脚部分foreach(wordDocument.MainDocumentPart.FooterParts 中的页脚部分页脚部分){ReplaceParagraphParts(footerPart.Footer, regex);}}}私有静态无效 ReplaceParagraphParts(OpenXmlElement 元素,正则表达式正则表达式){foreach (var 段落 in element.Descendants()){匹配匹配 = regex.Match(paragraph.InnerText);如果(匹配.成功){//创建一个新的运行并将其值设置为正确的文本//这必须在删除子运行之前完成,否则//paragraph.InnerText 将为空运行 newRun = new Run();newRun.AppendChild(new Text(paragraph.InnerText.Replace(match.Value, "some new value")));//删除任何子运行段落.RemoveAllChildren();//添加新创建的运行段落.AppendChild(newRun);}}}

上述方法的一个缺点是您可能拥有的任何样式都将丢失.这些可以从现有的 Run 复制,但如果有多个具有不同属性的 Run,您需要确定哪些需要复制到哪里.如果需要的话,没有什么可以阻止您在上面的代码中创建多个 Run,每个都具有不同的属性.其他元素也会丢失(例如任何符号),因此也需要考虑这些元素.

I'm writing a .NET application that should read a .docx file nearby 200 pages long (trough DocumentFormat.OpenXML 2.5) to find all the occurences of certain tags that the document should contain. To be clear I'm not looking for OpenXML tags but rather tags that should be set into the document by the document writer as placeholder for values I need to fill up in a second stage. Such tags should be in the following format:

 <!TAG!>

(where TAG can be an arbitrary sequence of characters). As I said I have to find all the occurences of such tags plus (if possibile) locating the 'page' where the tag occurence have been found. I found something looking around in the web but more than once the base approach was to dump all the content of the file in a string and then look inside such string regardless of the .docx encoding. This either caused false positive or no match at all (while the test .docx file contains several tags), other examples were probably a little over my knowledge of OpenXML. The regex pattern to find such tags should be something of this kind:

<!(.)*?!>

The tag can be found all over the document (inside table, text, paragraph, as also header and footer).

I'm coding in Visual Studio 2013 .NET 4.5 but I can get back if needed. P.S. I would prefer code without use of Office Interop API since the destination platform will not run Office.

The smallest .docx example I can produce store this inside document

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w:rsidR="00CA7780" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:proofErr w:type="gramStart"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>&lt;!TAG1</w:t>
  </w:r>
  <w:proofErr w:type="gramEnd"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!&gt;</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRPr="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY2</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="_GoBack"/>
  <w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="00815E5D" w:rsidRPr="00815E5D">
  <w:pgSz w:w="11906" w:h="16838"/>
  <w:pgMar w:top="1417" w:right="1134" w:bottom="1134" w:left="1134" w:header="708" w:footer="708" w:gutter="0"/>
  <w:cols w:space="708"/>
  <w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

Best Regards, Mike

解决方案

The problem with trying to find tags is that words are not always in the underlying XML in the format that they appear to be in Word. For example, in your sample XML the <!TAG1!> tag is split across multiple runs like this:

<w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>&lt;!TAG1</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
    <w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!&gt;</w:t>
</w:r>

As pointed out in the comments this is sometimes caused by the spelling and grammar checker but that's not all that can cause it. Having different styles on parts of the tag could also cause it for example.

One way of handling this is to find the InnerText of a Paragraph and compare that against your Regex. The InnerText property will return the plain text of the paragraph without any formatting or other XML within the underlying document getting in the way.

Once you have your tags, replacing the text is the next problem. Due to the above reasons you can't just replace the InnerText with some new text as it wouldn't be clear as to which parts of the text would belong in which Run. The easiest way round this is to remove any existing Run's and add a new Run with a Text property containing the new text.

The following code shows finding the tags and replacing them immediately rather than using two passes as you suggest in your question. This was just to make the example simpler to be honest. It should show everything you need.

private static void ReplaceTags(string filename)
{
    Regex regex = new Regex("<!(.)*?!>", RegexOptions.Compiled);

    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filename, true))
    {
        //grab the header parts and replace tags there
        foreach (HeaderPart headerPart in wordDocument.MainDocumentPart.HeaderParts)
        {
            ReplaceParagraphParts(headerPart.Header, regex);
        }
        //now do the document
        ReplaceParagraphParts(wordDocument.MainDocumentPart.Document, regex);
        //now replace the footer parts
        foreach (FooterPart footerPart in wordDocument.MainDocumentPart.FooterParts)
        {
            ReplaceParagraphParts(footerPart.Footer, regex);
        }
    }
}

private static void ReplaceParagraphParts(OpenXmlElement element, Regex regex)
{
    foreach (var paragraph in element.Descendants<Paragraph>())
    {
        Match match = regex.Match(paragraph.InnerText);
        if (match.Success)
        {
            //create a new run and set its value to the correct text
            //this must be done before the child runs are removed otherwise
            //paragraph.InnerText will be empty
            Run newRun = new Run();
            newRun.AppendChild(new Text(paragraph.InnerText.Replace(match.Value, "some new value")));
            //remove any child runs
            paragraph.RemoveAllChildren<Run>();
            //add the newly created run
            paragraph.AppendChild(newRun);
        }
    }
}

One downside with the above approach is that any styles you may have had will be lost. These could be copied from the existing Run's but if there are multiple Run's with differing properties you'll need to work out which ones you need to copy where. There's nothing to stop you creating multiple Run's in the above code each with different properties if that's what is required. Other elements would also be lost (e.g. any symbols) so those would need to be accounted for too.

这篇关于OpenXML 标签搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆