如何构建HTML org.w3c.dom.Document? [英] How can I build an HTML org.w3c.dom.Document?

查看:1528
本文介绍了如何构建HTML org.w3c.dom.Document?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<$ c的文档$ c>文档界面将界面描述为:


文档界面代表整个HTML或XML文档。


javax.xml.parsers.DocumentBuilder 构建XML 文档秒。但是,我无法找到一种方法来构建一个文档,这是一个HTML 文档



我想要一个HTML 文档,因为我正在尝试构建一个文档,然后传递给一个库期待HTML 文档。这个库以非区分大小写的方式使用 Document#getElementsByTagName(String tagname),这对HTML是好的,但不适用于XML。



我已经环顾四周,没有找到任何东西。像如何将网页的HTML源转换为java中的org.w3c.dom.Document?实际上没有答案。

您似乎有两个明确的要求:


  1. 您需要将HTML代表为<$ c $您需要文档#getElementsByTagName(String tagname) 以不区分大小写的方式操作。

如果您尝试使用 org.w3c.dom.Document ,那么我假设你正在使用XHTML的一些风格。因为诸如DOM之类的XML API将会期待格式良好的XML。 HTML不一定格式良好,但XHTML格式良好。即使您正在使用HTML,您也必须先进行一些预处理,以确保其格式正确,然后再尝试通过XML解析器运行它。首先用HTML解析器解析HTML可能更容易,例如 jsoup ,然后构建您的 org.w3c.dom.Document 通过浏览HTML解析器生成的树( org.jsoup.nodes.Document 在以下情况下jsoup)。






有一个 org.w3c.dom.html.HTMLDocument 界面,其中扩展 org.w3c.dom.Document 。我发现的唯一实现是在 Xerces-j (2.11.0)中以 org.apache.html.dom.HTMLDocumentImpl 。起初,这似乎是有希望的,但是经过仔细检查,我们发现有一些问题。



1。没有一个明确的干净的方法来获取实现 org.w3c.dom.html.HTMLDocument 接口的对象的实例。



使用Xerces我们通常会使用 DocumentBuilder 获取一个文档以下方式:

  DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
DocumentBuilder builder = factory.newDocumentBuilder();
文档doc = builder.newDocument();
//或doc = builder.parse(xmlFile)如果从文件解析

或使用 DOMImplementation 品种:

  DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance() ; 
DOMImplementationLS impl =(DOMImplementationLS)registry.getDOMImplementation(LS);
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS,null);
文档文档= lsParser.parseURI(myFile.xml);

在这两种情况下,我们纯粹使用 org.w3c.dom。 接口来获取文档对象。



找到的code> HTMLDocument 是这样的:

  HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation(); 
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument(我的标题);

这需要我们直接实例化内部实现类,使我们的实现依赖于Xerces。



(注意:我也看到Xerces还有一个内部的 HTMLBuilder (它实现了不推荐使用的 DocumentHandler ),可以使用SAX解析器生成一个 HTMLDocument,但是我并没有打扰到。



2。 org.w3c.dom.html.HTMLDocument 不会生成正确的XHTML。



虽然,你可以使用 getElementsByTagName(String tagname)以不区分大小写的方式搜索 HTMLDocument 树,所有元素名称保存在所有CAPS中。但XHTML元素和属性名称应该在全部小写。 (这可以通过走整个文档树并使用文档 renameNode()方法来改变所有的元素名称为小写。)



此外,XHTML文档应该具有适当的 DOCTYPE声明 xmlns声明XHTML命名空间。在 HTMLDocument 中(除非您使用内部Xerces实现做某些事情),似乎并没有直接的方法。



3。 org.w3c.dom.html.HTMLDocument 几乎没有文档,并且Xerces实现的界面似乎不完整。



我没有冲刷整个互联网,但是我发现的唯一的文档是 HTMLDocument 是之前链接的JavaDocs,并且在Xerces内部的源代码中发表了评论实现。在这些评论中,我还发现说明了接口的几个不同部分没有实现。 (Sidenote:我真的很感觉到,这个界面本身并没有被任何人真正使用,也可能是本身不完整的。 )






出于这些原因,我认为避免 org.w3c .dom.html.HTMLDocument ,只需使用 org.w3c.dom.Document 即可。我们可以做些什么?



一种方法是扩展 org.apache.xerces.dom.DocumentImpl (其中扩展 org.apache.xerces.dom.CoreDocumentImpl 实现 org.w3c.dom.Document )。这种方法不需要太多的代码,但是它仍然使我们的实现依赖于Xerces,因为我们正在扩展 DocumentImpl 。在我们的 MyHTMLDocumentImpl 中,我们只是将元素创建和搜索的所有标签名称转换为小写。这将允许以不区分大小写的方式使用文档#getElementsByTagName(String tagname)



MyHTMLDocumentImpl

  import org.apache.xerces.dom.DocumentImpl; 
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//层次结构中的某个基类实现org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {

private static final long serialVersionUID = 1658286253541962623L ;


/ **
*创建一个需要符合
*的基本元素的文档< a href =http://www.w3.org / TR / xhtml1 /#strict> XHTML标准< / a> ;.
*< pre>
* {@code
*<?xml version =1.0encoding =UTF-8?>
*<!DOCTYPE html
* PUBLIC - // W3C // DTD XHTML 1.0严格// EN
*http://www.w3.org/TR/xhtml1 /DTD/xhtml1-strict.dtd\">
*< html xmlns =http://www.w3.org/1999/xhtml>
*< head>
*< title>我的标题< / title>
*< / head>
*< body />
*< / html>
*}
*< / pre>
*
* @param title标题标题所需的文本内容。如果为空,则不会添加任何文本。
* @返回基本的HTML文档。
* /
public static Document makeBasicHtmlDoc(String title){
Document htmlDoc = new MyHTMLDocumentImpl();
DocumentType docType = new DocumentTypeImpl(null,html,
- // W3C // DTD XHTML 1.0 Strict // EN,
http://www.w3.org /TR/xhtml1/DTD/xhtml1-strict.dtd);
htmlDoc.appendChild(docType);
元素htmlElement = htmlDoc.createElementNS(http://www.w3.org/1999/xhtml,html);
htmlDoc.appendChild(htmlElement);
元素headElement = htmlDoc.createElement(head);
htmlElement.appendChild(headElement);
元素titleElement = htmlDoc.createElement(title);
if(title!= null)
titleElement.setTextContent(title);
headElement.appendChild(titleElement);
元素bodyElement = htmlDoc.createElement(body);
htmlElement.appendChild(bodyElement);

return htmlDoc;
}

/ **
*此方法将允许我们从现有文档中创建一个
* MyHTMLDocumentImpl。
* /
public static Document createFrom(Document doc){
Document htmlDoc = new MyHTMLDocumentImpl();
DocumentType originDocType = doc.getDoctype();
if(originDocType!= null){
DocumentType docType = new DocumentTypeImpl(null,originDocType.getName(),
originDocType.getPublicId(),
originDocType.getSystemId()) ;
htmlDoc.appendChild(docType);
}
节点docElement = doc.getDocumentElement();
if(docElement!= null){
Node copiedDocElement = docElement.cloneNode(true);
htmlDoc.adoptNode(copiedDocElement);
htmlDoc.appendChild(copiedDocElement);
}
return htmlDoc;
}

私人MyHTMLDocumentImpl(){
super();
}

@Override
public Element createElement(String tagName)throws DOMException {
return super.createElement(tagName.toLowerCase());


@Override
public Element createElementNS(String namespaceURI,String qualifiedName)throws DOMException {
return super.createElementNS(namespaceURI,qualifiedName.toLowerCase());
}

@Override
public NodeList getElementsByTagName(String tagname){
return super.getElementsByTagName(tagname.toLowerCase());
}

@Override
public NodeList getElementsByTagNameNS(String namespaceURI,String localName){
return super.getElementsByTagNameNS(namespaceURI,localName.toLowerCase());
}

@Override
public Node renameNode(Node n,String namespaceURI,String qualifiedName)throws DOMException {
return super.renameNode(n,namespaceURI,qualifiedName。 toLowerCase());
}
}

测试员:

  import java.io.File; 
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest {

private final static int P_ELEMENT_NUM = 3;

public static void main(String [] args)//我在这里抛出我的所有例外来缩短示例,但显然应该适当地处理它们。
抛出ClassNotFoundException,InstantiationException,IllegalAccessException,ClassCastException,IOException {

Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc(My Title);

//用一些示例内容填充html文档
元素bodyElement =(Element)htmlDoc.getElementsByTagName(body)。item(0); (int i = 0; i< P_ELEMENT_NUM; ++ i){
元素pElement = htmlDoc.createElement(p);

String id = Integer.toString(i + 1);
pElement.setAttribute(id,anId+ id);
pElement.setTextContent(这里是一些文本+ id +。);
bodyElement.appendChild(pElement);
}

//以不区分大小写的方式获取标题元素。
NodeList titleNodeList = htmlDoc.getElementsByTagName(tItLe); (int i = 0; i< titleNodeList.getLength(); ++ i)
System.out.println(titleNodeList.item(i).getTextContent());


System.out.println();

{//获取用小写
搜索的所有p元素NodeList pNodeList = htmlDoc.getElementsByTagName(p); (int i = 0; i< pNodeList.getLength(); ++ i){
System.out.println(pNodeList.item(i).getTextContent());

}
}

System.out.println();

{//获取用大写
搜索的所有p元素NodeList pNodeList = htmlDoc.getElementsByTagName(P); (int i = 0; i< pNodeList.getLength(); ++ i){
System.out.println(pNodeList.item(i).getTextContent());

}
}

System.out.println();

//序列化
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS domImplLS =(DOMImplementationLS)registry.getDOMImplementation(LS);

LSSerializer lsSerializer = domImplLS.createLSSerializer();
DOMConfiguration domConfig = lsSerializer.getDomConfig();
domConfig.setParameter(format-pretty-print,true); //如果你想要它漂亮和缩进

LSOutput lsOutput = domImplLS.createLSOutput();
lsOutput.setEncoding(UTF-8);

//写入文件
try(OutputStream os = new FileOutputStream(new File(myFile.html))){
lsOutput.setByteStream(os);
lsSerializer.write(htmlDoc,lsOutput);
}

//打印到屏幕
System.out.println(lsSerializer.writeToString(htmlDoc));
}

}

输出:

 我的标题

这是一些text1。
这是一些text2。
这是一些text3。

这是一些text1。
这是一些text2。
这是一些text3。

<?xml version =1.0encoding =UTF-8?><!DOCTYPE html PUBLIC - // W3C // DTD XHTML 1.0 Strict // ENhttp ://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd>
< html xmlns =http://www.w3.org/1999/xhtml>
< head>
< title>我的标题< / title>
< / head>
< body>
< p id =anId1>这里是一些text1。< / p>
< p id =anId2>这里是一些text2。< / p>
< p id =anId3>这里是一些text3。< / p>
< / body>
< / html>






类似于上述的另一种方法是改为一个包含文档对象的文档包装器,并实现文档界面本身这需要比扩展 DocumentImpl 方法更多的代码,但是这种方式是更干净,因为我们不必关心特定的 / code>实现。这种方法的额外代码并不困难;为 Document 方法提供所有包装器实现只是一点乏味。我还没有完全解决这个问题,可能有一些问题,但是如果它有效,这是一般的想法:

  public class MyHTMLDocumentWrapper implements Document {

private Document doc;

public MyHTMLDocumentWrapper(Document doc){
// ...
this.doc = doc;
// ...
}

// ...
}






无论是 org.w3c.dom.html.HTMLDocument ,其中一个我上面提到的方法或其他方法,也许这些建议可以帮助您了解如何继续的想法。






编辑:



在我的解析测试中,尝试解析以下XHTML文件时,Xerces会挂起一个实体管理类,试图打开一个http连接。为什么我不知道?特别是因为我在没有实体的本地html文件上进行测试。 (可能与DOCTYPE或命名空间有关系)?这是文件:

 <?xml version =1.0编码= UTF-8 >?; 
<!DOCTYPE html PUBLIC
- // W3C // DTD XHTML 1.0严格// EN
http://www.w3.org/TR/xhtml1/DTD/ XHTML1-strict.dtd>
< html xmlns =http://www.w3.org/1999/xhtml>
< head>
< title>我的标题< / title>
< / head>
< body>
< p id =anId1>这里是一些text1。< / p>
< p id =anId2>这里是一些text2。< / p>
< p id =anId3>这里是一些text3。< / p>
< / body>
< / html>


The documentation of the Document interface describes the interface as:

The Document interface represents the entire HTML or XML document.

javax.xml.parsers.DocumentBuilder builds XML Documents. However, I am unable to find a way to build a Document that is an HTML Document!

I want an HTML Document because I am trying to build a document that I then pass to a library that is expecting an HTML Document. This library uses Document#getElementsByTagName(String tagname) in a non case-sensitive manner, which is fine for HTML, but not for XML.

I've looked around, and am not finding anything. Items like How to convert an Html source of a webpage into org.w3c.dom.Document in java? don't actually have an answer.

解决方案

You seem to have two explicit requirements:

  1. You need to represent HTML as a org.w3c.dom.Document.
  2. You need Document#getElementsByTagName(String tagname) to operate in a case-insensitive manner.

If you are trying to work with HTML using org.w3c.dom.Document, then I assume you are working with some flavor of XHTML. Because an XML API, such as DOM, is going to expect well-formed XML. HTML isn't necessarily well-formed XML, but XHTML is well-formed XML. Even if you were working with HTML, you would have to do some pre-processing to ensure it is well-formed XML before trying to run it through an XML parser. It might just be easier to parse the HTML first with an HTML parser, such as jsoup, and then build your org.w3c.dom.Document by walking through the HTML parser's produced tree (org.jsoup.nodes.Document in the case of jsoup).


There is an org.w3c.dom.html.HTMLDocument interface, which extends org.w3c.dom.Document. The only implementation I found was in Xerces-j (2.11.0) in the form of org.apache.html.dom.HTMLDocumentImpl. At first this seems promising, however upon closer examination, we find that there are some issues.

1. There is not a clear, "clean" way to obtain an instance of an object implementing the org.w3c.dom.html.HTMLDocument interface.

With Xerces we normally would obtain a Document object using a DocumentBuilder in the following fashion:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file

Or using a DOMImplementation variety:

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");

In both cases, we are purely using org.w3c.dom.* interfaces to obtain the Documentobject.

The closest equivalent I found for HTMLDocument was something like this:

HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

This requires us to directly instantiate internal implementation classes making us implementation dependent on Xerces.

(Note: I also saw Xerces also had an internal HTMLBuilder (which implements the deprecated DocumentHandler) that can supposedly generate an HTMLDocument using a SAX parser, but I didn't bother looking into it.)

2. org.w3c.dom.html.HTMLDocument does not generate proper XHTML.

Although, you can search the HTMLDocument tree using getElementsByTagName(String tagname) in a case-insensitive manner, all of the element names are saved internally in ALL CAPS. But XHTML elements and attribute names are supposed to be in all lowercase. (This could be worked around by walking the entire document tree and using Document's renameNode() method to change all of the element's names to lowercase.)

Additionally, an XHTML document is supposed to have a proper DOCTYPE declaration and xmlns declaration for the XHTML namespace . There doesn't seem to be a straightforward way to set those in an HTMLDocument (unless you do some fiddling with internal Xerces implementations).

3. org.w3c.dom.html.HTMLDocument has little documentation, and Xerces implementation of the interface seems incomplete.

I didn't scour the entire Internet, but the only documentation I found for HTMLDocument was the previously linked JavaDocs, and comments in the source code of the Xerces internal implementation. In those comments, I also found notes that several different parts of the interface weren't implemented. (Sidenote: I really got the impression that the org.w3c.dom.html.HTMLDocument interface itself isn't really used by anyone and perhaps is incomplete itself.)


For those reasons, I think it's better to avoid org.w3c.dom.html.HTMLDocument and just do what we can with org.w3c.dom.Document. What can we do?

Well one approach is to extend org.apache.xerces.dom.DocumentImpl (which extends org.apache.xerces.dom.CoreDocumentImpl which implements org.w3c.dom.Document). This approach doesn't require much code, but it still makes us implementation dependent on Xerces since we are extending DocumentImpl. In our MyHTMLDocumentImpl, we are just converting all tag names to lowercase on element creation and searches. This will allow use of Document#getElementsByTagName(String tagname) in a case-insensitive manner.

MyHTMLDocumentImpl:

import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {

    private static final long serialVersionUID = 1658286253541962623L;


    /**
     * Creates an Document with basic elements required to meet
     * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
     * <pre>
     * {@code
     * <?xml version="1.0" encoding="UTF-8"?>
     * <!DOCTYPE html 
     *     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     *     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
     * <html xmlns="http://www.w3.org/1999/xhtml">
     *     <head>
     *         <title>My Title</title>
     *     </head>
     *     <body/>
     * </html>
     * }
     * </pre>
     * 
     * @param title desired text content for title tag. If null, no text will be added.
     * @return basic HTML Document. 
     */
    public static Document makeBasicHtmlDoc(String title) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType docType = new DocumentTypeImpl(null, "html",
                "-//W3C//DTD XHTML 1.0 Strict//EN",
                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
        htmlDoc.appendChild(docType);
        Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
        htmlDoc.appendChild(htmlElement);
        Element headElement = htmlDoc.createElement("head");
        htmlElement.appendChild(headElement);
        Element titleElement = htmlDoc.createElement("title");
        if(title != null)
            titleElement.setTextContent(title);
        headElement.appendChild(titleElement);
        Element bodyElement = htmlDoc.createElement("body");
        htmlElement.appendChild(bodyElement);

        return htmlDoc;
    }

    /**
     * This method will allow us to create a our
     * MyHTMLDocumentImpl from an existing Document.
     */
    public static Document createFrom(Document doc) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType originDocType = doc.getDoctype();
        if(originDocType != null) {
            DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
                    originDocType.getPublicId(),
                    originDocType.getSystemId());
            htmlDoc.appendChild(docType);
        }
        Node docElement = doc.getDocumentElement();
        if(docElement != null) {
            Node copiedDocElement = docElement.cloneNode(true);
            htmlDoc.adoptNode(copiedDocElement);
            htmlDoc.appendChild(copiedDocElement);
        }
        return htmlDoc;
    }

    private MyHTMLDocumentImpl() {
        super();
    }

    @Override
    public Element createElement(String tagName) throws DOMException {
        return super.createElement(tagName.toLowerCase());
    }

    @Override
    public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException {
        return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagName(String tagname) {
        return super.getElementsByTagName(tagname.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagNameNS(String namespaceURI, String localName) {
        return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
    }

    @Override
    public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException {
        return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());
    }
}

Tester:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest {

    private final static int P_ELEMENT_NUM = 3;

    public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
            throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException {

        Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");

        //populate the html doc with some example content
        Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
        for(int i = 0; i < P_ELEMENT_NUM; ++i) {
            Element pElement = htmlDoc.createElement("p");
            String id = Integer.toString(i+1);
            pElement.setAttribute("id", "anId"+id);
            pElement.setTextContent("Here is some text"+id+".");
            bodyElement.appendChild(pElement);
        }

        //get the title element in a case insensitive manner.
        NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
        for(int i = 0; i < titleNodeList.getLength(); ++i)
            System.out.println(titleNodeList.item(i).getTextContent());

        System.out.println();

        {//get all p elements searching with lowercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("p");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        {//get all p elements searching with uppercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("P");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        //to serialize
        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");

        LSSerializer lsSerializer = domImplLS.createLSSerializer();
        DOMConfiguration domConfig = lsSerializer.getDomConfig();
        domConfig.setParameter("format-pretty-print", true);  //if you want it pretty and indented

        LSOutput lsOutput = domImplLS.createLSOutput();
        lsOutput.setEncoding("UTF-8");

        //to write to file
        try (OutputStream os = new FileOutputStream(new File("myFile.html"))) {
            lsOutput.setByteStream(os);
            lsSerializer.write(htmlDoc, lsOutput);
        }

        //to print to screen
        System.out.println(lsSerializer.writeToString(htmlDoc)); 
    }

}

Output:

My Title

Here is some text1.
Here is some text2.
Here is some text3.

Here is some text1.
Here is some text2.
Here is some text3.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>


Another approach similar to the above is to instead make a Document wrapper that wraps a Document object and implements the Document interface itself. This requires more code than the "extending DocumentImpl" approach, but this way is "cleaner" as we don't have to care about particular Document implementations. The extra code for this approach isn't difficult; it's just a bit tedious to provide all those wrapper implementations for the Document methods. I haven't completely worked this out yet and there may be some problems, but if it works, this is the general idea:

public class MyHTMLDocumentWrapper implements Document {

    private Document doc;

    public MyHTMLDocumentWrapper(Document doc) {
        //...
        this.doc = doc;
        //...
    }

    //...
}


Whether it's org.w3c.dom.html.HTMLDocument, one of the approaches I mentioned above, or something else, maybe these suggestions will help give you an idea of how to proceed.


Edit:

In my parsing tests while trying to parse the following XHTML file, Xerces would hang down in an entity management class trying to open an http connection. Why I don't know? Especially since I tested on a local html file with with no entities. (Maybe something to do with the DOCTYPE or namespace?) This is the document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

这篇关于如何构建HTML org.w3c.dom.Document?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆