将网页中的html正确加载到Java中的字符串的最简单方法 [英] Simplest way to correctly load html from web page into a string in Java

查看:92
本文介绍了将网页中的html正确加载到Java中的字符串的最简单方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



非常感谢!

解决方案

一个非常常见的错误是无法正确地将HTTP响应从字节转换为字符。要做到这一点,你必须知道响应的字符编码。希望这是在Content-Type参数中指定的参数。但是将它放在body中,作为 meta 标记中的http-equiv属性也是一个选项。



因此,将页面正确加载到 String 中非常复杂,甚至像HttpClient这样的第三方库也没有提供通用解决方案。



下面是一个简单的实现,它将处理最常见的情况:

  URL url =新的URL(http://stackoverflow.com/questions/1381617); 
URLConnection con = url.openConnection();
Pattern p = Pattern.compile(text / html; \\s + charset =([^ \\\s] +)\\s *);
Matcher m = p.matcher(con.getContentType());
/ *如果Content-Type与此前概念不匹配,请选择default和
*以获得最佳效果。 * /
String charset = m.matches()? m.group(1):ISO-8859-1;
读者r =新的InputStreamReader(con.getInputStream(),charset);
StringBuilder buf = new StringBuilder();
while(true){
int ch = r.read();
if(ch <0)
break;
buf.append((char)ch);
}
String str = buf.toString();


Just what the title says.

Help greatly appreciated!

解决方案

An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a meta tag is also an option.

So, it is surprisingly complicated to load a page into a String correctly, and even 3rd party libraries like HttpClient don't offer a general solution.

Here's a simple implementation that will handle the most common case:

URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and 
 * hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
Reader r = new InputStreamReader(con.getInputStream(), charset);
StringBuilder buf = new StringBuilder();
while (true) {
  int ch = r.read();
  if (ch < 0)
    break;
  buf.append((char) ch);
}
String str = buf.toString();

这篇关于将网页中的html正确加载到Java中的字符串的最简单方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆