Java - 将命名的html实体转换为编号的xml实体 [英] Java - convert named html entities to numbered xml entities

查看:90
本文介绍了Java - 将命名的html实体转换为编号的xml实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找将包含html命名实体的html块转换为使用编号xml实体的xml兼容块,同时保留所有html标签元素。

这是通过测试说明的基本想法:

  @Test 
public void testEvalHtmlEntitiesToXmlEntities(){
String input =< a href = \test.html\> link& nbsp;< / a>;
String expected =< a href = \test.html\> link&#160;< / a>;
String actual = SomeUtil.eval(input);
Assert.assertEquals(预计,实际);
}

是否有人知道提供此功能的类?我可以编写一个正则表达式遍历非元素匹配,并执行:

  xlmString + = StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(htmlString )); 

但希望有更简单的方法或已经提供此类的类。

解决方案

您是否尝试过使用 JTidy
$ b $ pre $ private String cleanData(String data)throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setInputEncoding(UTF-8);
tidy.setOutputEncoding(UTF-8);
tidy.setPrintBodyOnly(true); //只打印内容
tidy.setXmlOut(true); //到XML
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes(UTF-8));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream,outputStream);
return outputStream.toString(UTF-8);
}

虽然我认为它会修复一些HTML代码以防万一。

I'm looking to convert an html block that contains html named entities to an xml compliant block that uses numbered xml entities while leaving all html tag elements in place.

This is the basic idea illustrated via test:

@Test
public void testEvalHtmlEntitiesToXmlEntities() {
    String input = "<a href=\"test.html\">link&nbsp;</a>";
    String expected = "<a href=\"test.html\">link&#160;</a>";
    String actual = SomeUtil.eval(input);
    Assert.assertEquals(expected, actual);
}

Is anyone aware of a Class that provides this functionality? I can write a regex to iterate through non element matches and do:

xlmString += StringEscapeUtils.escapeXml(StringEscapeUtils.unescapeHtml(htmlString));

but hoped there is an easier way or a Class that already provides this.

解决方案

Have you tried with JTidy?

private String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setInputEncoding("UTF-8");
    tidy.setOutputEncoding("UTF-8");
    tidy.setPrintBodyOnly(true); // only print the content
    tidy.setXmlOut(true); // to XML
    tidy.setSmartIndent(true); 
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream, outputStream);
    return outputStream.toString("UTF-8");
}

Although I think it will repair some of your HTML code in case has something.

这篇关于Java - 将命名的html实体转换为编号的xml实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆