使用apache poi从Word(* .docx)到HTML读取方程式及其文本上下文 [英] Reading equations from Word (*.docx) to HTML together with their text context using apache poi

查看:239
本文介绍了使用apache poi从Word(* .docx)到HTML读取方程式及其文本上下文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在构建一个 java 代码,以使用 apache POI 将word文档(.docx)读入我们的程序. 当我们在文档中遇到公式和化学方程式时,我们会陷入困境. 但是,我们设法读取了公式,但是我们不知道如何在相关字符串中找到其索引.

We are building a java code to read word document (.docx) into our program using apache POI. We are stuck when we encounter formulas and chemical equation inside the document. Yet, we managed to read formulas but we have no idea how to locate its index in concerned string..

输入(格式为*.docx)

text before formulae **CHEMICAL EQUATION** text after

我们设计的输出(格式应为HTML)

OUTPUT (format shall be HTML) we designed

text before formulae text after **CHEMICAL EQUATION**

我们无法获取字符串并将其恢复为原始格式.

We are unable to fetch the string and reconstruct to its original form.

问题

现在可以用任何方法在剥离线中定位图像和公式的位置,以便在重建字符串后可以将其恢复为原始形式,而不是附加它在字符串末尾.?

Now is there any way to locate the position of the image and formulae within the stripped line, so that it can be restored to its original form after reconstruction of the string, as against having it appended at the end of string.?

推荐答案

如果所需格式为HTML,则Word文本内容与

If the needed format is HTML, then Word text content together with Office MathML equations can be read the following way.

In Reading equations & formula from Word (Docx) to html and save database using java I have provided an example which gets all Office MathML equations out of an Word document into HTML. It uses paragraph.getCTP().getOMathList() and paragraph.getCTP().getOMathParaList() to get the OMath elements from the paragraph. This takes the OMath elements out of the text context.

如果要与段落中的其他元素一起获取那些OMath元素,则需要使用org.apache.xmlbeans.XmlCursor循环遍历该段落中的所有不同XML元素.下面的示例使用XmlCursor来使文本与段落中的OMath元素一起运行.

If one wants get those OMath elements in context with the other elements in the paragraphs, then using a org.apache.xmlbeans.XmlCursor is needed to loop over all different XML elements in the paragraph. The following example uses the XmlCursor to get text runs together with OMath elements from the paragraph.

Office MathML MathML 的转换是使用相同的XSLT方法如阅读方程式从Word(Docx)到html的公式,并使用java 保存数据库.还描述了OMML2MML.XSL的来源.

The transformation from Office MathML into MathML is taken using the same XSLT approach as in Reading equations & formula from Word (Docx) to html and save database using java. There also is described where the OMML2MML.XSL comes from.

文件Formula.docx如下:

代码:

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.apache.xmlbeans.XmlCursor;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.4.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadTextWithFormulasAsHTML {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet);

 //method for getting MathML from oMath
 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 //method for getting HTML including MathML from XWPFParagraph
 static String getTextAndFormulas(XWPFParagraph paragraph) throws Exception {

  StringBuffer textWithFormulas = new StringBuffer();

  //using a cursor to go through the paragraph from top to down
  XmlCursor xmlcursor = paragraph.getCTP().newCursor();

  while (xmlcursor.hasNextToken()) {
   XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
   if (tokentype.isStart()) {
    if (xmlcursor.getName().getPrefix().equalsIgnoreCase("w") && xmlcursor.getName().getLocalPart().equalsIgnoreCase("r")) {
     //elements w:r are text runs within the paragraph
     //simply append the text data
     textWithFormulas.append(xmlcursor.getTextValue());
    } else if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("oMath")) {
     //we have oMath
     //append the oMath as MathML
     textWithFormulas.append(getMathML((CTOMath)xmlcursor.getObject()));
    } 
   } else if (tokentype.isEnd()) {
    //we have to check whether we are at the end of the paragraph
    xmlcursor.push();
    xmlcursor.toParent();
    if (xmlcursor.getName().getLocalPart().equalsIgnoreCase("p")) {
     break;
    }
    xmlcursor.pop();
   }
  }

  return textWithFormulas.toString();
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //using a StringBuffer for appending all the content as HTML
  StringBuffer allHTML = new StringBuffer();

  //loop over all IBodyElements - should be self explained
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    allHTML.append("<p>");
    allHTML.append(getTextAndFormulas(paragraph));
    allHTML.append("</p>");
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement;
    allHTML.append("<table border=1>");
    for (XWPFTableRow row : table.getRows()) {
     allHTML.append("<tr>");
     for (XWPFTableCell cell : row.getTableCells()) {
      allHTML.append("<td>");
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       allHTML.append("<p>");
       allHTML.append(getTextAndFormulas(paragraph));
       allHTML.append("</p>");
      }
      allHTML.append("</td>");
     }
     allHTML.append("</tr>");
    }
    allHTML.append("</table>");
   }
  }

  document.close();

  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");

  writer.write(allHTML.toString());

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}

结果:

这篇关于使用apache poi从Word(* .docx)到HTML读取方程式及其文本上下文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆