使用xml解析在Word文档中查找隐式分页符 [英] Finding implicit page break in word document using xml parsing

查看:140
本文介绍了使用xml解析在Word文档中查找隐式分页符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要提取Word文档的首页内容.如果查看wordML文档的openxml,我会看到类似以下内容: <w:lastRenderedPageBreak /> 当用户输入硬分页符时发生<w:br w:type="page" />. 我不了解在所有情况下<w:lastRenderedPageBreak />发生的情况.它在某些隐式分页符情况下发生,但不是全部. 例如:我输入了一些文本,然后按Enter键几次,光标转到下一页,如果仍然在新页面中按几次Enter键,这就是我得到的

I need to extract the first page content of a word document. If I look at the openxml for a wordML document I could see things like: <w:lastRenderedPageBreak /> or it would seem <w:br w:type="page" /> <w:br w:type="page" /> occurs when user enters an hard page break. I don't understand in what all cases <w:lastRenderedPageBreak /> occurs. It occurs in some of the implict page break cases but not all. For example: I typed some text and then pressed enter several times and cursor goes to the next page and if I still press enter several times in the new page this is what I get

    **DOCUMENT.XML**
- <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
-   <w:r>
      <w:t xml:space="preserve">All my fun TEXT.</w:t>
</w:r>
</w:p>
  <w:p w:rsidR="0061403F" w:rsidRDefault="0061403F" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />   <-{page break}
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
- <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
-     <w:r>
         <w:t xml:space="preserve">All my fun TEXT.</w:t>
  </w:r>
</w:p>

正如您看到的那样,即使在我键入Enter时光标移至下一页,在提取的word文档文件夹中的document.xml文件中也没有关于此活动的线索. 有人可以帮助我找到Word文档中的隐式分页符,以便提取文档第一页的内容吗? 如果无法在openxml中检测到特定的页面内容,那么将每个Word文档页面转换为pdf页面时,pdf转换工具如何工作?

As you could see even though the cursor goes to the next page as I type enter,there is no clue regarding this activity in document.xml file in extracted word document folder. Can someone help me in finding the implicit page break in the word document so that I can extract the content of the first page of the document? If there is no way of detecting particular page content in openxml, how does pdf conversion tools work where each word document page is converted as a page in pdf?

请不要建议使用没有规定提取特定页面内容的API(例如POI). 查找隐式分页符的原因是因为我的任务涉及提取word文档中的封面图像.我所遵循的启发式是如果文档的第一页仅包含图像,则它是封面图像,否则没有封面图像.因此,我需要单独获取首页的内容,并检查它是否仅包含图像.我该怎么办?

Please do not suggest using APIs like POI which have no provision to extract particular page content. Edit : The reason for finding the implicit page break is because my task involves extracting the cover image in a word document.The heuristics that im following is "if the first page of the document contains only an image then it is a cover image otherwise there is no cover image ".So i need to get the content of the first page alone and check if it has only an image.How can i do it ?

推荐答案

简短的答案是,无法通过检查XML来完成所需的操作. Word(或PDF转换器)的页面呈现引擎决定了页面中断的位置. XML只是描述了呈现引擎将要流动"的内容.

The short answer is that it's not possible to do what you want by examining the XML. The page rendering engine of Word (or a PDF converter) is what determines where the page breaks. The XML simply describes the content to be "flowed" by the rendering engine.

这篇关于使用xml解析在Word文档中查找隐式分页符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆