将复杂的.docx文件导入为.xml并提取章节 [英] import a complex .docx file as .xml and extract the chapters

查看:198
本文介绍了将复杂的.docx文件导入为.xml并提取章节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

- 更新 - 也许有人可以假设另一种可能性,所以将 .docx 文件拆分到其章节中,导入 .docx 到R

--update-- maybe someone can assume another possibility so split a .docxdocument into its chapters, importing .docxto R

首先,我要感谢这个很棒的论坛。我为即将发生的问题找到了几种解决方案
但是这次我还没找到任何东西......

first of all, I want to give thanks for this awesome forum. I found several solutions for my upcoming issues. But this time I haven't found anything...

但是,我有一个复杂的 .docx 包含索引的文档,格式为 .xml

However, I have a complex .docx document, containing an index, formatted to .xml.

library(XML)
xmlfile <- xmlParse("C:/Users/Documents/stihl.xml", options = HUGE)

topxml <- xmlRoot(xmlfile)

topxml <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml), row.names = NULL, node)

以及读取XML文件的其他可能性。
我的 .docx 文档有一个索引,现在我想提取几个索引内容。作为 .docx 示例

And other possibilities to read an XML file. My .docx document has an index and now I want to extract the several index content. As an .docx example

1. Introduction  
   This is an introduction importing XML by R.  
2. UserGuide  
   Userguides are often helpful.  
2.1 Style  
   The style should be always the same.  
2.2 Language  
   I hope my Language is readable, because I'm contacting you from Germany. 

因此,收到分隔章节的内容会很好,例如存储在向量。

As a result it would be nice to receive the content of the seperated chapters, for example stored in a vector.

result 
[1]This is an introduction importing XML by R.
[2]Userguides are often helpful.
[3]The style should be always the same.
[4]I hope my Language is readable, because I'm contacting you from Germany.

也许还有其他可能保留结构但我提到了一个包含树结构的XML导入最简单方式。

Maybe there are other possibilities keeping the structure but I mentioned an XML import containing the tree structure as the easiest way.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">

  <pkg:part 
    pkg:name="/_rels/.rels" 
    pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" 
    pkg:padding="512">
    <pkg:xmlData>
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
          <Relationship 
           Id="rId3" 
           Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" 
           Target="docProps/app.xml"/>
          <Relationship 
           Id="rId2" 
           Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" 
           Target="docProps/core.xml"/>
          <Relationship Id="rId1" 
           Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
           Target="word/document.xml"/>
       </Relationships>
    </pkg:xmlData>
  </pkg:part>

  <pkg:part 
   #serveral relationships
  </pkg:part>

  <pkg:part 
    pkg:name="/word/document.xml" 
    pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
     <pkg:xmlData>

      <w:document mc:Ignorable="w14 w15 wp14" 




    xmlns:wpc:http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas
   xmlns:mc:http://schemas.openxmlformats.org/markup-compatibility/2006
   xmlns:o:urn:schemas-microsoft-com:office:office
    xmlns:r:http://schemas.openxmlformats.org/officeDocument/2006/relationships
    xmlns:m:http://schemas.openxmlformats.org/officeDocument/2006/math
    xmlns:v:urn:schemas-microsoft-com:vml
    xmlns:wp14:http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing
    xmlns:wp:http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing
    xmlns:w10:urn:schemas-microsoft-com:office:word
    xmlns:w:http://schemas.openxmlformats.org/wordprocessingml/2006/main
    xmlns:w14:http://schemas.microsoft.com/office/word/2010/wordml
   xmlns:w15:http://schemas.microsoft.com/office/word/2012/wordml
    xmlns:wpg:http://schemas.microsoft.com/office/word/2010/wordprocessingGroup
    xmlns:wpi:http://schemas.microsoft.com/office/word/2010/wordprocessingInk
    xmlns:wne:http://schemas.microsoft.com/office/word/2006/wordml
   xmlns:wps:http://schemas.microsoft.com/office/word/2010/wordprocessingShape

         <w:body>

           <w:p> ...
          </w:p>

          <w:p w14:paraId="5BB64FEF" w14:textId="77777777" w:rsidR="005A3789" w:rsidRDefault="005A3789" w:rsidP="005A3789">
           <w:pPr>
            <w:pStyle w:val="Inhaltsverzeichnisberschrift"/>
           </w:pPr>
           <w:r>
            <w:lastRenderedPageBreak/>
            <w:t>Inhaltsverzeichnis</w:t>
           </w:r>
          </w:p>

'Inhaltsverzeichnis'是我索引的标题。路径是
包 - > 3.part - > xmldata - > document - > body - > p

'Inhaltsverzeichnis' is the titel of my index. The path is package -> 3.part -> xmldata -> document -> body -> p

这些信息存储在这里,例如

The information is stored here for example

<w:p w14:paraId="15ECF978" w14:textId="77777777" w:rsidR="009B5500" w:rsidRDefault="005A3789">
<w:pPr>
<w:pStyle w:val="Verzeichnis1"/>
<w:rPr>
<w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> TOC \o "1-4" \h \z \u 
</w:instrText>
</w:r>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:hyperlink w:anchor="_Toc474825312" w:history="1">
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220"><w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
                  **<w:t>1</w:t>**
</w:r>
<w:r w:rsidR="009B5500"><w:rPr><w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr><w:tab/>
</w:r>
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220">
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
                  **<w:t>Management Summary</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:tab/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr><w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:instrText xml:space="preserve"> PAGEREF _Toc474825312 \h </w:instrText>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
               **<w:t>6</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:hyperlink>
</w:p>

这是索引的第一个条目, 1。管理摘要6

This is the first entry of the index, 1. Management Summary 6

推荐答案

我们可以使用:

library(xml2)
library(magrittr)

x <- read_xml("path/to/file.xml")

titles <- xml_find_all(x, 
               "/pkg:package//pkg:part/pkg:xmlData/w:document/w:body/w:p/w:hyperlink/w:r/w:t") %>%  
         xml_text() %>% 
         matrix(ncol = 3, byrow = T) %>% 
         as.data.frame()

colnames(titles)<- c('numChapter', 'title', 'numPage')

这将检索与该xpath对应的所有节点内的文本。

This retrives the text inside all the nodes corresponding to that xpath.

根据您给出的示例,xpath包含(我想的是) numChapter ,其标题及其 numPage

Based on your given example that xpath contains (what I suppose are) the numChapter, its title and its numPage.

如上所述,如果xml格式不正确和/或缺少某些名称空间,则会出错。

As noted this will give an error if the xml is not well formed and/or some namespaces are missing.

希望这有助于

这篇关于将复杂的.docx文件导入为.xml并提取章节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆