阅读方程公式从 Word (Docx) 到 html 并使用 java 保存数据库 [英] Reading equations & formula from Word (Docx) to html and save database using java

查看:27
本文介绍了阅读方程公式从 Word (Docx) 到 html 并使用 java 保存数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 word/docx 文件,其中包含图像下方的方程

我想读取文件 word/docx 的数据并保存到我的数据库需要时我可以从数据库中获取数据并显示在我的 html 页面上我使用 apache Poi 从 docx 文件中读取数据,但它不能使用方程请帮帮我!

解决方案

Word *.docx 文件是包含 ZIP 档案XML 文件,它们是

Java 代码:

import java.io.*;导入 org.apache.poi.xwpf.usermodel.*;导入 org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;导入 org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;导入 org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;导入 org.w3c.dom.Node;导入 javax.xml.transform.Transformer;导入 javax.xml.transform.TransformerFactory;导入 javax.xml.transform.dom.DOMSource;导入 javax.xml.transform.stream.StreamSource;导入 javax.xml.transform.stream.StreamResult;导入 java.awt.Desktop;导入 java.util.List;导入 java.util.ArrayList;/*需要 https://poi.apache.org/faq.html#faq-N10025 中提到的完整 ooxml-schemas-1.3.jar*/公共类 WordReadFormulas {静态文件样式表 = 新文件(OMML2MML.XSL");静态 TransformerFactory tFactory = TransformerFactory.newInstance();静态流源样式源 = 新流源(样式表);静态字符串 getMathML(CTOMath ctomath) 抛出异常 {Transformer 变压器 = tFactory.newTransformer(stylesource);节点 node = ctomath.getDomNode();DOMSource source = new DOMSource(node);StringWriter stringwriter = new StringWriter();StreamResult 结果 = 新的 StreamResult(stringwriter);transformer.setOutputProperty(omit-xml-declaration", yes");转换器.transform(源,结果);String mathML = stringwriter.toString();stringwriter.close();//原生 OMML2MML.XSL 将 OMML 转换为 MathML 作为具有特殊名称空间的 XML.//我们不需要这个,因为我们想在 HTML 中使用 MathML,而不是在 XML 中.//所以理想情况下我们应该改变 OMML2MML.XSL 不这样做.//但是为了让这个例子尽可能简单,我们使用替换来摆脱 XML 特性.mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");mathML = mathML.replaceAll("xmlns:mml", "xmlns");mathML = mathML.replaceAll("mml:", "");返回 mathML;}public static void main(String[] args) 抛出异常 {XWPFDocument 文档 = new XWPFDocument(new FileInputStream("Formula.docx"));//将找到的 MathML 存储在字符串的 AllayList 中列表<字符串>mathMLList = new ArrayList();//从所有body元素中获取公式for (IBodyElement ibodyelement : document.getBodyElements()) {如果 (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {XWPFParagraph 段落 = (XWPFParagraph)ibodyelement;for (CTOMath ctomath : 段落.getCTP().getOMathList()) {mathMLList.add(getMathML(ctomath));}for (CTOMathPara ctomathpara :paragraph.getCTP().getOMathParaList()) {for (CTOMath ctomath : ctomathpara.getOMathList()) {mathMLList.add(getMathML(ctomath));}}} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {XWPFTable 表 = (XWPFTable)ibodyelement;for (XWPFTableRow 行:table.getRows()) {for (XWPFTableCell 单元格:row.getTableCells()) {for (XWPFParagraph 段落: cell.getParagraphs()) {for (CTOMath ctomath : 段落.getCTP().getOMathList()) {mathMLList.add(getMathML(ctomath));}for (CTOMathPara ctomathpara :paragraph.getCTP().getOMathParaList()) {for (CTOMath ctomath : ctomathpara.getOMathList()) {mathMLList.add(getMathML(ctomath));}}}}}}}文档.close();//创建一个示例HTML文件字符串编码 =UTF-8";FileOutputStream fos = new FileOutputStream(result.html");OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);writer.write("\n");writer.write("<html lang=\"en\">");writer.write("");writer.write("<meta charset=\"utf-8\"/>");//使用MathJax帮助所有浏览器解释MathMLwriter.write("<script type=\"text/javascript\"");writer.write("async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");writer.write(">");writer.write("</script>");writer.write("</head>");writer.write("");writer.write("<p>在 Word 文档中发现以下公式:</p>");int i = 1;for (String mathML: mathMLList) {writer.write("<p>公式" + i++ + ":</p>");writer.write(mathML);writer.write("<p/>");}writer.write("</body>");writer.write("</html>");writer.close();Desktop.getDesktop().browse(new File("result.html").toURI());}}

结果:


刚刚使用 apache poi 5.0.0 测试了这段代码,它可以工作.apache poi 5.0.0 需要 poi-ooxml-full-5.0.0.jar.请阅读 https://poi.apache.org/help/faq.html#faq-N10025 对于什么 apache poi 版本需要什么 ooxml 库.

I have a word/docx file which has equations as under images

I want read data of file word/docx and save to my database and when need I can get data from database and show on my html page I used apache Poi for read data form docx file but It can't take equations Please help me!

解决方案

Word *.docx files are ZIP archives containing XML files which are Office Open XML. The formulas contained in Word *.docx documents are Office MathML (OMML).

Unfortunately this XML format is not really well known outside Microsoft Office. So it is not directly usable in HTML for example. But fortunately it is XML and as such it is transformable using Transforming XML Data with XSLT. So we can transform that OMML into MathML for example, which is usable in a wider area of use cases.

A transformation process via XSLT mainly bases on a XSL definition of the transformation. Unfortunately creating a such is also not really easy. But fortunately Microsoft has done that already and if you have a current Microsoft Office installed, you can find this file OMML2MML.XSL in the Microsoft Office program directory in %ProgramFiles%\. If you don't find it, do a web research to get it.

So if we are knowing this all, we can getting the OMML from the XWPFDocument, transforming it into MathML and then saving that for later usage.

My example stores the found formulas as MathML in a ArrayList of strings. You should also be able storing this strings in your data base.

The example needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath which is not shipped with the smaller poi-ooxml-schemas jar.

Word document:

Java code:

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadFormulas {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet); 

 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 public static void main(String[] args) throws Exception {
    
  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //storing the found MathML in a AllayList of strings
  List<String> mathMLList = new ArrayList<String>();

  //getting the formulas out of all body elements
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
     mathMLList.add(getMathML(ctomath));
    }
    for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
     for (CTOMath ctomath : ctomathpara.getOMathList()) {
      mathMLList.add(getMathML(ctomath));
     }
    }
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement; 
    for (XWPFTableRow row : table.getRows()) {
     for (XWPFTableCell cell : row.getTableCells()) {
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
        mathMLList.add(getMathML(ctomath));
       }
       for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
        for (CTOMath ctomath : ctomathpara.getOMathList()) {
         mathMLList.add(getMathML(ctomath));
        }
       }
      }
     }
    }
   }
  }

  document.close();

  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>\n");
  writer.write("<html lang=\"en\">");
  writer.write("<head>");
  writer.write("<meta charset=\"utf-8\"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type=\"text/javascript\"");
  writer.write(" async src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML\"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");
  writer.write("<p>Following formulas was found in Word document: </p>");

  int i = 1;
  for (String mathML : mathMLList) {
   writer.write("<p>Formula" + i++ + ":</p>");
   writer.write(mathML);
   writer.write("<p/>");
  }

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}

Result:


Just tested this code using apache poi 5.0.0 and it works. You need poi-ooxml-full-5.0.0.jar for apache poi 5.0.0. Please read https://poi.apache.org/help/faq.html#faq-N10025 for what ooxml libraries are needed for what apache poi version.

这篇关于阅读方程公式从 Word (Docx) 到 html 并使用 java 保存数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆