将复杂的.docx文件导入为.xml并提取章节 [英] import a complex .docx file as .xml and extract the chapters

查看：198 发布时间：2018/8/2 15:40:02 r xml indexing extract

本文介绍了将复杂的.docx文件导入为.xml并提取章节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

- 更新 - 也许有人可以假设另一种可能性，所以将 .docx 文件拆分到其章节中，导入 .docx 到R

--update-- maybe someone can assume another possibility so split a .docxdocument into its chapters, importing .docxto R

首先，我要感谢这个很棒的论坛。我为即将发生的问题找到了几种解决方案
但是这次我还没找到任何东西......

first of all, I want to give thanks for this awesome forum. I found several solutions for my upcoming issues. But this time I haven't found anything...

但是，我有一个复杂的 .docx 包含索引的文档，格式为 .xml 。

However, I have a complex .docx document, containing an index, formatted to .xml.

library(XML)
xmlfile <- xmlParse("C:/Users/Documents/stihl.xml", options = HUGE)

topxml <- xmlRoot(xmlfile)

topxml <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml), row.names = NULL, node)

以及读取XML文件的其他可能性。
我的 .docx 文档有一个索引，现在我想提取几个索引内容。作为 .docx 示例

And other possibilities to read an XML file. My .docx document has an index and now I want to extract the several index content. As an .docx example

1. Introduction  
   This is an introduction importing XML by R.  
2. UserGuide  
   Userguides are often helpful.  
2.1 Style  
   The style should be always the same.  
2.2 Language  
   I hope my Language is readable, because I'm contacting you from Germany.

因此，收到分隔章节的内容会很好，例如存储在向量。

As a result it would be nice to receive the content of the seperated chapters, for example stored in a vector.

result 
[1]This is an introduction importing XML by R.
[2]Userguides are often helpful.
[3]The style should be always the same.
[4]I hope my Language is readable, because I'm contacting you from Germany.

也许还有其他可能保留结构但我提到了一个包含树结构的XML导入最简单方式。

Maybe there are other possibilities keeping the structure but I mentioned an XML import containing the tree structure as the easiest way.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">

  <pkg:part 
    pkg:name="/_rels/.rels" 
    pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" 
    pkg:padding="512">
    <pkg:xmlData>
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
          <Relationship 
           Id="rId3" 
           Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" 
           Target="docProps/app.xml"/>
          <Relationship 
           Id="rId2" 
           Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" 
           Target="docProps/core.xml"/>
          <Relationship Id="rId1" 
           Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
           Target="word/document.xml"/>
       </Relationships>
    </pkg:xmlData>
  </pkg:part>

  <pkg:part 
   #serveral relationships
  </pkg:part>

  <pkg:part 
    pkg:name="/word/document.xml" 
    pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
     <pkg:xmlData>

      <w:document mc:Ignorable="w14 w15 wp14" 




    xmlns:wpc:http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas
   xmlns:mc:http://schemas.openxmlformats.org/markup-compatibility/2006
   xmlns:o:urn:schemas-microsoft-com:office:office
    xmlns:r:http://schemas.openxmlformats.org/officeDocument/2006/relationships
    xmlns:m:http://schemas.openxmlformats.org/officeDocument/2006/math
    xmlns:v:urn:schemas-microsoft-com:vml
    xmlns:wp14:http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing
    xmlns:wp:http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing
    xmlns:w10:urn:schemas-microsoft-com:office:word
    xmlns:w:http://schemas.openxmlformats.org/wordprocessingml/2006/main
    xmlns:w14:http://schemas.microsoft.com/office/word/2010/wordml
   xmlns:w15:http://schemas.microsoft.com/office/word/2012/wordml
    xmlns:wpg:http://schemas.microsoft.com/office/word/2010/wordprocessingGroup
    xmlns:wpi:http://schemas.microsoft.com/office/word/2010/wordprocessingInk
    xmlns:wne:http://schemas.microsoft.com/office/word/2006/wordml
   xmlns:wps:http://schemas.microsoft.com/office/word/2010/wordprocessingShape

         <w:body>

           <w:p> ...
          </w:p>

          <w:p w14:paraId="5BB64FEF" w14:textId="77777777" w:rsidR="005A3789" w:rsidRDefault="005A3789" w:rsidP="005A3789">
           <w:pPr>
            <w:pStyle w:val="Inhaltsverzeichnisberschrift"/>
           </w:pPr>
           <w:r>
            <w:lastRenderedPageBreak/>
            <w:t>Inhaltsverzeichnis</w:t>
           </w:r>
          </w:p>

'Inhaltsverzeichnis'是我索引的标题。路径是
包 - > 3.part - > xmldata - > document - > body - > p

'Inhaltsverzeichnis' is the titel of my index. The path is package -> 3.part -> xmldata -> document -> body -> p

这些信息存储在这里，例如

The information is stored here for example

<w:p w14:paraId="15ECF978" w14:textId="77777777" w:rsidR="009B5500" w:rsidRDefault="005A3789">
<w:pPr>
<w:pStyle w:val="Verzeichnis1"/>
<w:rPr>
<w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> TOC \o "1-4" \h \z \u 
</w:instrText>
</w:r>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:hyperlink w:anchor="_Toc474825312" w:history="1">
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220"><w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
                  **<w:t>1</w:t>**
</w:r>
<w:r w:rsidR="009B5500"><w:rPr><w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr><w:tab/>
</w:r>
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220">
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
                  **<w:t>Management Summary</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:tab/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr><w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:instrText xml:space="preserve"> PAGEREF _Toc474825312 \h </w:instrText>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
               **<w:t>6</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:hyperlink>
</w:p>

这是索引的第一个条目， 1。管理摘要6

This is the first entry of the index, 1. Management Summary 6

将复杂的.docx文件导入为.xml并提取章节 [英] import a complex .docx file as .xml and extract the chapters

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将复杂的.docx文件导入为.xml并提取章节 [英] import a complex .docx file as .xml and extract the chapters

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭