将扩展转义模式用于jsoup输出时出现问题 [英] Problems using extended escape mode for jsoup output

查看:95
本文介绍了将扩展转义模式用于jsoup输出时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要通过从文件中删除某些标签来转换HTML文件.为此,我有类似的内容-

I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this -

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.Entities.EscapeMode;

import java.io.IOException;
import java.io.File;
import java.util.*;

public class TestJsoup {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];

        Document doc = null;
        if(url.contains("http")) {
           doc = Jsoup.connect(url).get();
        } else {
           File f = new File(url);
           doc = Jsoup.parse(f, null);
        }

        /* remove some tags */

        doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
        System.out.println(doc.html());

        return;
    }
}

上述代码的问题是,当我使用扩展转义模式时,输出的html标记属性为html编码.反正有避免这种情况的方法吗?使用转义模式作为基本或xhtml无效,因为某些非标准扩展(如’)编码会带来问题.对于以下HTML的ex,

The issue with the above code is that, when I use extended escape mode, the output has the html tag attributes being html encoded. Is there anyway to avoid this? Using escape mode as base or xhtml doesn't work as some of the non standard extended (like ’) encoding give problems. For ex for the HTML below,

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Test&reg;</title>
</head>
<body style="background-color:#EDEDED;">
<P>
   <font style="color:#003698; font-weight:bold;">Testing HTML encoding - &rsquo; &copy; with a <a href="http://www.google.com">link</a>
   </font> 
   <br />
</P>
</body>
</html>

我得到的输出是

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
 <head>&NewLine;
  <title>Test&reg;</title>&NewLine;
 </head>&NewLine;
 <body style="background-color&colon;&num;EDEDED&semi;">&NewLine;
  <p>&NewLine; <font style="color&colon;&num;003698&semi; font-weight&colon;bold&semi;">Testing HTML encoding - &rsquor; &copy; with a <a href="http&colon;&sol;&sol;www&period;g
oogle&period;com">link</a></font> <br />&NewLine;</p>&NewLine;&NewLine;&NewLine;&NewLine;
 </body>
</html>

总有办法解决这个问题吗?

Is there anyway to get around this issue?

推荐答案

What output encoding character set are you using? (It will default to the input, which if you are loading from URLs, will vary according to the site).

如果您正在使用无法处理UTF-8的系统,则可能需要将其显式设置为UTF-8ASCII或其他一些较低的设置.如果将转义模式设置为base(默认值),并且将字符集设置为ascii,则将无法在所选字符集中以本机表示的任何字符(如rsquo)输出为数字转义.

You probably want to explicitly set it to either UTF-8, or ASCII or some other low setting if you are working with systems that cannot deal with UTF-8. If you set the escape mode to base (the default), and the charset to ascii, then any character (like rsquo) than cannot be represented natively in the selected charset will be output as a numerical escape.

例如:

String check = "<p>&rsquo; <a href='../'>Check</a></p>";
Document doc = Jsoup.parse(check);
doc.outputSettings().escapeMode(Entities.EscapeMode.base); // default

doc.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc.body().html());

doc.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc.body().html());

赠予:

UTF-8: <p>’ <a href="../">Check</a></p>
ASCII: <p>&#8217; <a href="../">Check</a></p>

希望这会有所帮助!

这篇关于将扩展转义模式用于jsoup输出时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆