使用 apache poi 将公式从 Word (.docx) 读取到 HTML 及其文本上下文 [英] Reading equations from Word (.docx) to HTML together with their text context using apache poi

查看：27 发布时间：2021/11/12 4:43:35 java apache-poi position formula equation
本文介绍了使用 apache poi 将公式从 Word (*.docx) 读取到 HTML 及其文本上下文的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！
问题描述

我们正在构建一个 java 代码，以使用 apache POI 将 Word 文档 (.docx) 读入我们的程序.当我们遇到文档中的公式和化学方程式时，我们就卡住了.然而，我们设法阅读了公式，但我们不知道如何在相关字符串中定位其索引..
输入(格式为*.docx)
公式前的文字**化学方程式**后的文字
OUTPUT(格式应为HTML)我们设计的
公式前的文字 **化学方程式**后的文字
我们无法获取字符串并重建为其原始形式.
问题
现在有什么方法可以在剥离线内定位图像和公式的位置，以便在重建字符串后将其恢复为原始形式，而不是附加它在字符串的末尾.?
解决方案
如果需要的格式是HTML，则Word文本内容加上
代码:
import java.io.*;导入 org.apache.poi.xwpf.usermodel.*;导入 org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;导入 org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath；导入 org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara；导入 org.apache.xmlbeans.XmlCursor;导入 org.w3c.dom.Node;导入 javax.xml.transform.Transformer;导入 javax.xml.transform.TransformerFactory;导入 javax.xml.transform.dom.DOMSource;导入 javax.xml.transform.stream.StreamSource;导入 javax.xml.transform.stream.StreamResult;导入 java.awt.Desktop;导入 java.util.List;导入 java.util.ArrayList;/*需要 https://poi.apache.org/faq.html#faq-N10025 中提到的完整 ooxml-schemas-1.4.jar*/公共类 WordReadTextWithFormulasAsHTML {静态文件样式表 = 新文件(OMML2MML.XSL")；静态 TransformerFactory tFactory = TransformerFactory.newInstance();静态流源样式源 = 新流源(样式表)；//从oMath获取MathML的方法静态字符串 getMathML(CTOMath ctomath) 抛出异常 {Transformer 变压器 = tFactory.newTransformer(stylesource);节点 node = ctomath.getDomNode();DOMSource source = new DOMSource(node);StringWriter stringwriter = new StringWriter();StreamResult 结果 = 新的 StreamResult(stringwriter);transformer.setOutputProperty(omit-xml-declaration", yes");转换器.transform(源，结果)；String mathML = stringwriter.toString();stringwriter.close();//原生 OMML2MML.XSL 将 OMML 转换为 MathML 作为具有特殊名称空间的 XML.//我们不需要这个，因为我们想在 HTML 中使用 MathML，而不是在 XML 中.//所以理想情况下我们应该改变 OMML2MML.XSL 不这样做.//但是为了让这个例子尽可能简单，我们使用替换来摆脱 XML 特性.mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");mathML = mathML.replaceAll("xmlns:mml", "xmlns");mathML = mathML.replaceAll("mml:", "");返回 mathML；}//从XWPFParagraph获取包含MathML的HTML的方法静态字符串 getTextAndFormulas(XWPFParagraph 段落) 抛出异常 {StringBuffer textWithFormulas = new StringBuffer();//使用光标从上到下浏览段落XmlCursor xmlcursor = 段落.getCTP().newCursor();而 (xmlcursor.hasNextToken()) {XmlCursor.TokenType tokentype = xmlcursor.toNextToken();如果(令牌类型.isStart()){if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") &&& xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {//元素 w:r 是段落内的文本//简单地附加文本数据textWithFormulas.append(xmlcursor.getTextValue());} else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {//我们有oMath//将 oMath 附加为 MathMLtextWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));}} else if (tokentype.isEnd()) {//我们必须检查我们是否在段落的末尾xmlcursor.push();xmlcursor.toParent();if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {休息;}xmlcursor.pop();}}返回 textWithFormulas.toString();}public static void main(String[] args) 抛出异常 {XWPFDocument 文档 = new XWPFDocument(new FileInputStream("Formula.docx"));//使用 StringBuffer 将所有内容附加为 HTMLStringBuffer allHTML = new StringBuffer();//遍历所有IBodyElements - 应该自我解释for (IBodyElement ibodyelement : document.getBodyElements()) {如果 (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {XWPFParagraph 段落 = (XWPFParagraph)ibodyelement;allHTML.append("");allHTML.append(getTextAndFormulas(paragraph));allHTML.append("</p>");} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {XWPFTable 表 = (XWPFTable)ibodyelement;allHTML.append("
");for (XWPFTableRow 行:table.getRows()) {allHTML.append("");for (XWPFTableCell 单元格:row.getTableCells()) {allHTML.append("");for (XWPFParagraph 段落: cell.getParagraphs()) {allHTML.append("");allHTML.append(getTextAndFormulas(paragraph));allHTML.append("</p>");}allHTML.append("</td>");}allHTML.append("</tr>");}allHTML.append("</table>");}}文档.close();//创建一个示例HTML文件字符串编码 =UTF-8"；FileOutputStream fos = new FileOutputStream(result.html");OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);writer.write("\n");writer.write("<html lang=\"en\">");writer.write("");writer.write("<meta charset=\"utf-8\"/>");//使用MathJax帮助所有浏览器解释MathMLwriter.write("<script type=\"text/javascript\"");writer.write("async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");writer.write(">");writer.write("</script>");writer.write("</head>");writer.write("");writer.write(allHTML.toString());writer.write("</body>");writer.write("</html>");writer.close();Desktop.getDesktop().browse(new File("result.html").toURI());}}
结果:
刚刚使用 apache poi 5.0.0 测试了这段代码，它可以工作.apache poi 5.0.0 需要 poi-ooxml-full-5.0.0.jar.请阅读 https://poi.apache.org/help/faq.html#faq-N10025 对于什么 apache poi 版本需要什么 ooxml 库.
We are building a java code to read word document (.docx) into our program using apache POI.
We are stuck when we encounter formulas and chemical equation inside the document.
Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..

INPUT (format is *.docx)

text before formulae **CHEMICAL EQUATION** text after

OUTPUT (format shall be HTML) we designed

text before formulae text after **CHEMICAL EQUATION**

We are unable to fetch the string and reconstruct to its original form.

Question 

Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?
 解决方案 
If the needed format is HTML, then Word text content together with Office MathML equations can be read the following way.
In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML equations out of an Word document into HTML. It uses paragraph.getCTP().getOMathList() and paragraph.getCTP().getOMathParaList() to get the OMath elements from the paragraph. This takes the OMath elements out of the text context.
If one wants get those OMath elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor is needed to loop over all different XML elements in the paragraph. The following example uses the XmlCursor to get text runs together with OMath elements from the paragraph.
The transformation from Office MathML into MathML is taken using the same XSLT approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL comes from.
The file Formula.docx looks like:

Code:
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.apache.xmlbeans.XmlCursor;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadTextWithFormulasAsHTML {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet);

 //method for getting MathML from oMath
 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 //method for getting HTML including MathML from XWPFParagraph
 static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
  
  StringBuffer textWithFormulas = new StringBuffer();

  //using a cursor to go through the paragraph from top to down
  XmlCursor xmlcursor = paragraph.getCTP().newCursor();

  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
     //elements w:r are text runs within the paragraph
     //simply append the text data
     textWithFormulas.append(xmlcursor.getTextValue());
    } else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
     //we have oMath
     //append the oMath as MathML
     textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
    } 
   } else if (tokentype.isEnd()) {
    //we have to check whether we are at the end of the paragraph
    xmlcursor.push();
    xmlcursor.toParent();
    if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
     break;
    }
    xmlcursor.pop();
   }
  }
  
  return textWithFormulas.toString();
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //using a StringBuffer for appending all the content as HTML
  StringBuffer allHTML = new StringBuffer();

  //loop over all IBodyElements - should be self explained
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    allHTML.append("<p>");
    allHTML.append(getTextAndFormulas(paragraph));
    allHTML.append("</p>");
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement;
    allHTML.append("<table border=1>");
    for (XWPFTableRow row : table.getRows()) {
     allHTML.append("<tr>");
     for (XWPFTableCell cell : row.getTableCells()) {
      allHTML.append("<td>");
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       allHTML.append("<p>");
       allHTML.append(getTextAndFormulas(paragraph));
       allHTML.append("</p>");
      }
      allHTML.append("</td>");
     }
     allHTML.append("</tr>");
    }
    allHTML.append("</table>");
   }
  }

  document.close();

  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");

  writer.write(allHTML.toString());

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}
Result:


Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.

                        这篇关于使用 apache poi 将公式从 Word (*.docx) 读取到 HTML 及其文本上下文的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    
                    
                        查看全文
                    
                
				
                            
                        
                
            
        
        
            



        
        
            相关文章
            
                    
                        
                            使用apache poi从Word(* .docx)到HTML读取方程式及其文本上下文;
                        
                    
                    
                        
                            使用 Apache poi 从 docx 获取文本样式;
                        
                    
                    
                        
                            使用Apache Poi从docx获取文本样式;
                        
                    
                    
                        
                            Apache POI - 将 Word 文档 (docx) 拆分为页面;
                        
                    
                    
                        
                            使用 Apache POI 将 Word 转换为 HTML;
                        
                    
                    
                        
                            将 HTML 转换为 docx - Apache POI Java;
                        
                    
                    
                        
                            Apache POI-将Word文档(docx)拆分为页面;
                        
                    
                    
                        
                            将HTML转换为docx-Apache POI Java;
                        
                    
                    
                        
                            Word 中的上下文菜单;
                        
                    
                    
                        
                            Apache POI - Word (docx) 文档中的多列;
                        
                    
                    
                        
                            在文本框中更改字体大小-Apache POI Word DOCX;
                        
                    
                    
                        
                            使用 Apache POI XWPF 将图像添加到 word .docx 文档标题中;
                        
                    
                    
                        
                            无法在Apache POI中按Word文档(docx)的顺序读取所有内容;
                        
                    
                    
                        
                            使用Apache POI替换docx文本框中的文本;
                        
                    
                    
                        
                            使用Apache POI替换docx文本框中的文本;
                        
                    
                    
                        
                            使用Apache POI替换docx文本框中的文本;
                        
                    
                    
                        
                            Apache的POI的Word文档（.doc和.docx）更新;
                        
                    
                    
                        
                            Apache POI-Word(docx)文档中的多列;
                        
                    
                    
                        
                            LXML HTML XPath上下文;
                        
                    
                    
                        
                            Word 2010的表上下文菜单;
                        
                    
                    
                        
                            转换Word与Apache POI为HTML;
                        
                    
                    
                        
                            如何使用Apache POI将.docx转换为.doc;
                        
                    
                    
                        
                            Apache POI 评估公式;
                        
                    
                    
                        
                            getDefaultShared preferences（上下文） - 任何上下文？;
                        
                    
                    
                        
                            是否可以删除“上下文"?从加载的上下文列表[]?;
                        
                    
            
        
        
            



        
    
    
        
            Java开发最新文章
            
                    
                        
                            Tomcat 404错误：原始服务器没有找到目标资源的当前表示，或者不愿意透露该目录的存在;
                        
                    
                    
                        
                            由于缺少ServletWebServerFactory bean，无法启动ServletWebServerApplicationContext;
                        
                    
                    
                        
                            无法反序列化的java.util.ArrayList实例出来VALUE_STRING的;
                        
                    
                    
                        
                            什么是AssertionError？在这种情况下，我应该从我自己的代码中抛出？;
                        
                    
                    
                        
                            JSON反序列化投掷例外 - 无法反序列化的java.util.ArrayList实例出来START_OBJECT令牌;
                        
                    
                    
                        
                            Maven构建错误 - 无法执行目标org.apache.maven.plugins：Maven的组装插件：2.5.5;
                        
                    
                    
                        
                            正确使用Optional.ifPresent（）;
                        
                    
                    
                        
                            获取异常（org.apache.poi.openxml4j.exception  - 没有内容类型[M1.13]）阅读使用Apache POI XLSX文件时？;
                        
                    
                    
                        
                            SpringBoot  - 制作jar文件 - 在META-INF / spring.factories中找不到自动配置类;
                        
                    
                    
                        
                            HTTP状态404  - 请求的资源（/）不可用;
                        
                    
            
        
        
            
                热门教程
            
            
                
                    
                        Java教程
                    
                
                
                    
                        Apache ANT 教程
                    
                
                
                    
                        Kali Linux教程
                    
                
                
                    
                        JavaScript教程
                    
                
                
                    
                        JavaFx教程
                    
                
                
                    
                        MFC 教程
                    
                
                
                    
                        Apache HTTP客户端教程
                    
                
                
                    
                        Microsoft Visio 教程
                    
                
            
        
        
            
                热门工具
            
            
                
                
                    
                        Java 在线工具
                    
                
                
                    
                        C(GCC) 在线工具
                    
                
                
                    
                        PHP 在线工具
                    
                
                
                    
                        C# 在线工具
                    
                
                
                    
                        Python 在线工具
                    
                
                
                    
                        MySQL 在线工具
                    
                
                
                    
                        VB.NET 在线工具
                    
                
                
                    
                        Lua 在线工具
                    
                
                
                    
                        Oracle 在线工具
                    
                
                
                    
                        C++(GCC) 在线工具
                    
                
                
                    
                        Go 在线工具
                    
                
                
                    
                        Fortran 在线工具
                    
                
            
        
        
    


    

    
        
            登录
            关闭
        
        
            
                扫码关注1秒登录
            
            
                
            
            
                
                
            
            
                发送“验证码”获取
                |
                15天全站免登陆
            
            
        
    
    





    
		
			友情链接：
            IT屋
            Chrome插件
            谷歌浏览器插件
        
        
            IT屋
            ©2016-2022 琼ICP备2021000895号-1
            站点地图
            站点标签
            SiteMap
            <免责申明>
            本站内容来源互联网,如果侵犯您的权益请联系我们删除.
使用 apache poi 将公式从 Word (*.docx) 读取到 HTML 及其文本上下文 [英] Reading equations from Word (*.docx) to HTML together with their text context using apache poi

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

使用 apache poi 将公式从 Word (.docx) 读取到 HTML 及其文本上下文 [英] Reading equations from Word (.docx) to HTML together with their text context using apache poi

登录关闭