如何将 HTML 转换为格式良好的 DOCX,样式属性完好无损 [英] How to convert HTML to well formed DOCX with styling attributes intact

查看:21
本文介绍了如何将 HTML 转换为格式良好的 DOCX,样式属性完好无损的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 docx4j 将 HTML5 文件转换为 docx.更大的图景是 HTML 包含阿拉伯语数据和英语数据.我在 HTML 中的元素上设置了样式.我的 HTML 在 chrome 上看起来很整洁,但是当我使用 docx4j 转换为 docx 时,阿拉伯文本格式丢失了.在 MS word 上,它显示我的阿拉伯语文本设置了粗体样式,但不是粗体.同样,RTL 方向也会丢失.表从 RTL 反转为 LTR.作为一种解决方法,我使用 BufferedWriter 生成 .doc 文件,该文件将我的 HTML 文件与样式属性相匹配,但 html 中存在 Base64 图像,该图像未出现在 .doc 文件中.因此,需要转换为 .docx 格式.我的要求是从我的 HTML 生成一个可编辑的文档.请指导我完成,因为我一直在挠头.没有源示例代码也能正常工作.

I am trying to convert HTML5 file to docx using docx4j. The bigger picture is that the HTML contains Arabic data and English data. I have set styling on elements in my HTML. My HTML looks neat on chrome but when I convert to docx using docx4j, arabic text formatting is lost. On MS word, it shows that my Arabic text has bold style set, but it isn't bold. Similarly, RTL directions are also lost. Tables are reversed from RTL to LTR. As a workaround, I used BufferedWriter to generate .doc file, which matched my HTML file with styling attributes but there is Base64 image present in the html, which doesn't appear in the .doc file. Hence, the need to convert to .docx format. My requirement is an editable document generated from my HTML. Please guide me through as I have been scratching my head. No source example codes are working as well.

这是我用来将 HTML 转换为 docx 的代码.

Here is the code I am using to convert HTML to docx.

public boolean convertHTMLToDocx(String inputFilePath, String outputFilePath, boolean headerFlag,
        boolean footerFlag,String orientation, String logoPath, String margin, JSONObject json,boolean isArabic) {
    boolean conversionFlag;
    boolean orientationFlag = false;
    try {
        if(!orientation.equalsIgnoreCase("Y")){
            orientationFlag = true;
        }
        String stringFromFile = FileUtils.readFileToString(new File(inputFilePath), "UTF-8");
        String unescaped = stringFromFile;
        WordprocessingMLPackage wordMLPackage  = WordprocessingMLPackage.createPackage();
        NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
        wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
        ndp.unmarshalDefaultNumbering();

        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.Bidi.Heuristic", true);
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.Element.Heading.MapToStyle", true);
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.serif", "Frutiger LT Arabic 45 Light");
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.sans-serif", "Frutiger LT Arabic 45 Light");
        ImportXHTMLProperties.setProperty("docx4j-ImportXHTML.fonts.default.monospace", "Frutiger LT Arabic 45 Light");

        XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
        xHTMLImporter.setHyperlinkStyle("Hyperlink");
        xHTMLImporter.setParagraphFormatting(FormattingOption.CLASS_PLUS_OTHER);
        xHTMLImporter.setTableFormatting(FormattingOption.CLASS_PLUS_OTHER);
        xHTMLImporter.setRunFormatting(FormattingOption.CLASS_PLUS_OTHER);

        wordMLPackage.getMainDocumentPart().getContent().addAll(xHTMLImporter.convert(unescaped, ""));

        XmlUtils.marshaltoString(wordMLPackage.getMainDocumentPart().getJaxbElement(),true,true);
        File output = new File(outputFilePath);

        wordMLPackage.save(output);

        Console.log("file path where it is stored is" + " " + output.getAbsolutePath());
        if (headerFlag || footerFlag) {
            File file = new File(outputFilePath);
            InputStream in = new FileInputStream(file);

            wordMLPackage = WordprocessingMLPackage.load(in);
            if (headerFlag) {
                // set Header 
            }
            if (footerFlag) {
                // set Footer
            }

            wordMLPackage.save(file);
            Console.log("Finished editing the word document");
        }
        conversionFlag = true;
    } catch (InvalidFormatException e) {
        Error.log("Invalid format found:-" + getStackTrace(e));
        conversionFlag = false;
    } catch (Exception e) {
        Error.log("Error while converting:-" + getStackTrace(e));
        conversionFlag = false;
    }

    return conversionFlag;
}

推荐答案

这是我的处理方式.这不是最好的方法,但我已经看到这在组织中实施.在这些方法中,他们在应用服务器上创建战争文件,以托管 HTTP 请求的静态和动态内容.

Here is how I approached it. It is not the best approach but yes I have seen this being implemented in organizations. In those approaches, they create war files on Application Servers for hosting static and dynamic content to HTTP Requests.

所以,我使用了一个简单的字节数组写入 .doc 文件而不是 .docx.这样,最终的 Word 文档将与 html 完全相同.我面临的唯一问题是没有显示二进制图像.只有一个盒子出现在图像的位置.

So, I used a simple byte array being written to .doc file instead of .docx. That way, the final word document will appear exactly the same as html. The only issue I faced was that binary images were not getting displayed. Only a box was appearing in place of image.

所以,我写了两个文件:

So, I wrote two files:

1st- 从 html 文件中读取我所有的二进制图像标签,并使用 Base64 解码器对图像进行解码.将所有解码后的图像保存在我的服务器主机的磁盘上,创建该文件的路径,并将 html 中所有此类 img 标签的 src 属性替换为磁盘上的此位置.(新位置前面有 http://{remote_server}:{remote_port}/{war_deployment_descriptor}/images/

1st- Read all my binary image tags from html file and used Base64 decoder to decode the images. Save all the decoded images on the disk on my server host, created the path to that file, and replaced the src attribute of all such img tags in html with this location on disk. (The new location was preceded with http://{remote_server}:{remote_port}/{war_deployment_descriptor}/images/<disk_path_where_image_was_stored>

2- 我在部署在服务器上的 war 文件中创建了一个简单的 servlet,它侦听/images 上的获取请求,并在收到带有路径名的获取请求时,在 OutputStream 上返回图像.

2nd- I created a simple servlet in my war file deployed on server which listened to get requests on /images and upon receiving get requests with path names, returned the image on OutputStream.

瞧,图像开始出现了.

免责声明 - 但是,这些图片在您的网络之外将不可见.我很幸运能够严格遵守客户的网络.为了让它们在网络外可用,您可以要求您的 IT 团队允许在开放网络或您想要可用性的网络上提供图像的路径.问题就解决了.

Disclaimer- These images will however not be visible outside of your network. I was lucky to have strict adherence to customer's network only. To get them available outside network, you may request your IT team to allow the path of the serving the images either on the open network or on the network you want the availability for. The problem will be solved.

编辑 - 您可以创建一个新的战争文件来托管这些图像或使用生成这些图像的文件.

Edit - You can create a new war file for hosting these images or use the one which is generating these images.

我的经验-对于英文文档,请使用 docx4j 进行 .docx 转换.对于阿拉伯语或希伯来语或其他 RTL 语言,请按上述方式进行 .doc 转换.然后,所有此类 .doc 文档都可以轻松地从 MS Word 转换为 .docx.

My experience- For English documents go for .docx conversion using docx4j. For Arabic or hebrew or other RTL languages go for .doc conversion as above. All such .doc documents can then be easily converted to .docx as well from MS Word.

列出这两个文件,请根据您的需要进行更改:

Listing the two files, please change as per your need:

File1.java

        public static void writeHTMLDatatoDoc(String content, String inputHTMLFile,String outputDocFile,String uniqueName) throws Exception {
            String baseTag = getRemoteServerURL()+"/{war_deployment_desciptor}/images?image=";
            String tag = "Image_";
            String ext = ".png";
            String srcTag = "";
            String pathOnServer = getDiskPath() + File.separator + "TemplateGeneration"
                    + File.separator + "generatedTemplates" + File.separator + uniqueName + File.separator + "images" + File.separator;
    
            int i = 0;
            boolean binaryimgFlag = false;
    
            Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
            Matcher m = p.matcher(content);
            while (m.find()) {
                String src = m.group();
                int startIndex = src.indexOf("src=") + 5;
                int endIndex = src.length();
                
                // srcTag will contain data as .........
                // Replace this whole later with path on local disk
                srcTag = src.substring(startIndex, src.length());
                
                if(srcTag.contains("base64")) {
                    binaryimgFlag = true;
                }
                if(binaryimgFlag) {
                    
                    // Extract image mime type and image extension from srcTag containing binary image
                    ext = extractMimeType(srcTag);
                    if(ext.lastIndexOf(".") != -1 && ext.lastIndexOf(".") != 0)
                        ext = ext.substring(ext.lastIndexOf(".")+1);
                    else 
                        ext = ".png";
                    
                    // read files already created for the different documents for this unique entity.
                    // The location contains all image files as Image_{i}.{image_extension}
                    // Sort files and read max counter in image names. 
                    // Increase value of i to generate next image as Image_{incremented_i}.{image_entension}
                    i = findiDynamicallyFromFilesCreatedForWI(pathOnServer);
                    i++; // Increase count for next image
                    
                    // save whole data to replace later
                    String srcTagBegin = srcTag; 
                    
                    // Remove data:image/png;base64, from srcTag , so I get only encoded image data.
                    // Decode this using Base64 decoder.
                    srcTag = srcTag.substring(srcTag.indexOf(",") + 1, srcTag.length());
                    byte[] imageByteArray = decodeImage(srcTag);
                    
                    // Constrcu replacement tag
                    String replacement = baseTag+pathOnServer+tag+i+ext;
                    replacement = replacement.replace("\\", "/");
    
                    // Writing image inside local directory on server
                    FileOutputStream imageOutFile = new FileOutputStream(pathOnServer+tag+i+ext);
                    imageOutFile.write(imageByteArray);
                    content = content.replace(srcTagBegin, replacement);
                    imageOutFile.close();
                }
            }
            
            //Re write HTML file
            writeHTMLData(content,inputHTMLFile);
    
            // write content to doc file
            writeHTMLData(content,outputDocFile);
        }
    
        public static int findiDynamicallyFromFilesCreatedForWI(String pathOnServer) {
            String path = pathOnServer;
            int nextFileCount = 0;
            String number = "";
            String[] dirListing = null;
            File dir = new File(path);
            dirListing = dir.list();
            if(dirListing.length != 0) {
                Arrays.sort(dirListing);
                int length = dirListing.length;
                int index = dirListing[length - 1].indexOf('.');
                number = dirListing[length - 1].substring(0,index);
                int index1 = number.indexOf('_');
                number = number.substring(index1+1,number.length());
                nextFileCount = Integer.parseInt(number);
            }
            return nextFileCount;
        }
    
        private static String extractMimeType(final String encoded) {
            final Pattern mime = Pattern.compile("^data:([a-zA-Z0-9]+/[a-zA-Z0-9]+).*,.*");
            final Matcher matcher = mime.matcher(encoded);
            if (!matcher.find())
                return "";
            return matcher.group(1).toLowerCase();
        }
    
        private static void writeHTMLData(String inputData, String outputFilepath) {
            BufferedWriter writer = null;
            try {
                writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(outputFilepath)), Charset.forName("UTF-8")));
                writer.write(inputData);
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try {
                    if(writer != null)
                        writer.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    
        public static byte[] decodeImage(String imageDataString) {
            return Base64.decodeBase64(imageDataString);
        }
    
        private static String readHTMLData(String inputFile) {
            String data = "";
            String str = "";
    
            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(new FileInputStream(new File(inputFile)), StandardCharsets.UTF_8))) {
                while ((str = reader.readLine()) != null) {
                    data += str;
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
            return data;
        }

File2.java

 import java.io.File;
 import java.io.IOException;
 import java.nio.file.Files;
 
 import javax.servlet.ServletException;
 import javax.servlet.http.HttpServlet;
 import javax.servlet.http.HttpServletRequest;
 import javax.servlet.http.HttpServletResponse;
 import com.newgen.clos.logging.consoleLogger.Console;
 public class ImageServlet extends HttpServlet {
     public void init() throws ServletException {
     public ImageServlet() {
         super();
     }
 
     protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
         String param = request.getParameter("image");
         Console.log("Image Servlet executed");
         Console.log("File Name Requested: " + param);
         param.replace("\"", "");
         param.replace("%20"," ");
         File file = new File(param);
         response.setHeader("Content-Type", getServletContext().getMimeType(param));
         response.setHeader("Content-Length", String.valueOf(file.length()));
         response.setHeader("Content-Disposition", "inline; filename=\"" + param + "\"");
         Files.copy(file.toPath(), response.getOutputStream());
     }
 }
 

这篇关于如何将 HTML 转换为格式良好的 DOCX,样式属性完好无损的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆