使用 apache poi 将公式从 Word (*.docx) 读取到 HTML 及其文本上下文 [英] Reading equations from Word (*.docx) to HTML together with their text context using apache poi

查看:27
本文介绍了使用 apache poi 将公式从 Word (*.docx) 读取到 HTML 及其文本上下文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在构建一个 java 代码,以使用 apache POI 将 Word 文档 (.docx) 读入我们的程序.当我们遇到文档中的公式和化学方程式时,我们就卡住了.然而,我们设法阅读了公式,但我们不知道如何在相关字符串中定位其索引..

输入(格式为*.docx)

公式前的文字**化学方程式**后的文字

OUTPUT(格式应为HTML)我们设计的

公式前的文字 **化学方程式**后的文字

我们无法获取字符串并重建为其原始形式.

问题

现在有什么方法可以在剥离线内定位图像和公式的位置,以便在重建字符串后将其恢复为原始形式,而不是附加它在字符串的末尾.?

解决方案

如果需要的格式是HTML,则Word文本内容加上

代码:

import java.io.*;导入 org.apache.poi.xwpf.usermodel.*;导入 org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;导入 org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;导入 org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;导入 org.apache.xmlbeans.XmlCursor;导入 org.w3c.dom.Node;导入 javax.xml.transform.Transformer;导入 javax.xml.transform.TransformerFactory;导入 javax.xml.transform.dom.DOMSource;导入 javax.xml.transform.stream.StreamSource;导入 javax.xml.transform.stream.StreamResult;导入 java.awt.Desktop;导入 java.util.List;导入 java.util.ArrayList;/*需要 https://poi.apache.org/faq.html#faq-N10025 中提到的完整 ooxml-schemas-1.4.jar*/公共类 WordReadTextWithFormulasAsHTML {静态文件样式表 = 新文件(OMML2MML.XSL");静态 TransformerFactory tFactory = TransformerFactory.newInstance();静态流源样式源 = 新流源(样式表);//从oMath获取MathML的方法静态字符串 getMathML(CTOMath ctomath) 抛出异常 {Transformer 变压器 = tFactory.newTransformer(stylesource);节点 node = ctomath.getDomNode();DOMSource source = new DOMSource(node);StringWriter stringwriter = new StringWriter();StreamResult 结果 = 新的 StreamResult(stringwriter);transformer.setOutputProperty(omit-xml-declaration", yes");转换器.transform(源,结果);String mathML = stringwriter.toString();stringwriter.close();//原生 OMML2MML.XSL 将 OMML 转换为 MathML 作为具有特殊名称空间的 XML.//我们不需要这个,因为我们想在 HTML 中使用 MathML,而不是在 XML 中.//所以理想情况下我们应该改变 OMML2MML.XSL 不这样做.//但是为了让这个例子尽可能简单,我们使用替换来摆脱 XML 特性.mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");mathML = mathML.replaceAll("xmlns:mml", "xmlns");mathML = mathML.replaceAll("mml:", "");返回 mathML;}//从XWPFParagraph获取包含MathML的HTML的方法静态字符串 getTextAndFormulas(XWPFParagraph 段落) 抛出异常 {StringBuffer textWithFormulas = new StringBuffer();//使用光标从上到下浏览段落XmlCursor xmlcursor = 段落.getCTP().newCursor();而 (xmlcursor.hasNextToken()) {XmlCursor.TokenType tokentype = xmlcursor.toNextToken();如果(令牌类型.isStart()){if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") &&& xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {//元素 w:r 是段落内的文本//简单地附加文本数据textWithFormulas.append(xmlcursor.getTextValue());} else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {//我们有oMath//将 oMath 附加为 MathMLtextWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));}} else if (tokentype.isEnd()) {//我们必须检查我们是否在段落的末尾xmlcursor.push();xmlcursor.toParent();if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {休息;}xmlcursor.pop();}}返回 textWithFormulas.toString();}public static void main(String[] args) 抛出异常 {XWPFDocument 文档 = new XWPFDocument(new FileInputStream("Formula.docx"));//使用 StringBuffer 将所有内容附加为 HTMLStringBuffer allHTML = new StringBuffer();//遍历所有IBodyElements - 应该自我解释for (IBodyElement ibodyelement : document.getBodyElements()) {如果 (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {XWPFParagraph 段落 = (XWPFParagraph)ibodyelement;allHTML.append("

");allHTML.append(getTextAndFormulas(paragraph));allHTML.append("</p>");} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {XWPFTable 表 = (XWPFTable)ibodyelement;allHTML.append("

");for (XWPFTableRow 行:table.getRows()) {allHTML.append("");for (XWPFTableCell 单元格:row.getTableCells()) {allHTML.append("
");for (XWPFParagraph 段落: cell.getParagraphs()) {allHTML.append("

");allHTML.append(getTextAndFormulas(paragraph));allHTML.append("</p>");}allHTML.append("</td>");}allHTML.append("</tr>");}allHTML.append("</table>");}}文档.close();//创建一个示例HTML文件字符串编码 =UTF-8";FileOutputStream fos = new FileOutputStream(result.html");OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);writer.write("\n");writer.write("<html lang=\"en\">");writer.write("");writer.write("<meta charset=\"utf-8\"/>");//使用MathJax帮助所有浏览器解释MathMLwriter.write("<script type=\"text/javascript\"");writer.write("async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");writer.write(">");writer.write("</script>");writer.write("</head>");writer.write("");writer.write(allHTML.toString());writer.write("</body>");writer.write("</html>");writer.close();Desktop.getDesktop().browse(new File("result.html").toURI());}}

结果:


刚刚使用 apache poi 5.0.0 测试了这段代码,它可以工作.apache poi 5.0.0 需要 poi-ooxml-full-5.0.0.jar.请阅读 https://poi.apache.org/help/faq.html#faq-N10025 对于什么 apache poi 版本需要什么 ooxml 库.

We are building a java code to read word document (.docx) into our program using apache POI. We are stuck when we encounter formulas and chemical equation inside the document. Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..

INPUT (format is *.docx)

text before formulae **CHEMICAL EQUATION** text after

OUTPUT (format shall be HTML) we designed

text before formulae text after **CHEMICAL EQUATION**

We are unable to fetch the string and reconstruct to its original form.

Question

Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?

解决方案

If the needed format is HTML, then Word text content together with Office MathML equations can be read the following way.

In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML equations out of an Word document into HTML. It uses paragraph.getCTP().getOMathList() and paragraph.getCTP().getOMathParaList() to get the OMath elements from the paragraph. This takes the OMath elements out of the text context.

If one wants get those OMath elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor is needed to loop over all different XML elements in the paragraph. The following example uses the XmlCursor to get text runs together with OMath elements from the paragraph.

The transformation from Office MathML into MathML is taken using the same XSLT approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL comes from.

The file Formula.docx looks like:

Code:

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.apache.xmlbeans.XmlCursor;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadTextWithFormulasAsHTML {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet);

 //method for getting MathML from oMath
 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 //method for getting HTML including MathML from XWPFParagraph
 static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {
  
  StringBuffer textWithFormulas = new StringBuffer();

  //using a cursor to go through the paragraph from top to down
  XmlCursor xmlcursor = paragraph.getCTP().newCursor();

  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
     //elements w:r are text runs within the paragraph
     //simply append the text data
     textWithFormulas.append(xmlcursor.getTextValue());
    } else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
     //we have oMath
     //append the oMath as MathML
     textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
    } 
   } else if (tokentype.isEnd()) {
    //we have to check whether we are at the end of the paragraph
    xmlcursor.push();
    xmlcursor.toParent();
    if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
     break;
    }
    xmlcursor.pop();
   }
  }
  
  return textWithFormulas.toString();
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //using a StringBuffer for appending all the content as HTML
  StringBuffer allHTML = new StringBuffer();

  //loop over all IBodyElements - should be self explained
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    allHTML.append("<p>");
    allHTML.append(getTextAndFormulas(paragraph));
    allHTML.append("</p>");
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement;
    allHTML.append("<table border=1>");
    for (XWPFTableRow row : table.getRows()) {
     allHTML.append("<tr>");
     for (XWPFTableCell cell : row.getTableCells()) {
      allHTML.append("<td>");
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       allHTML.append("<p>");
       allHTML.append(getTextAndFormulas(paragraph));
       allHTML.append("</p>");
      }
      allHTML.append("</td>");
     }
     allHTML.append("</tr>");
    }
    allHTML.append("</table>");
   }
  }

  document.close();

  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");

  writer.write(allHTML.toString());

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}

Result:


Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.

这篇关于使用 apache poi 将公式从 Word (*.docx) 读取到 HTML 及其文本上下文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆