将网页中的html正确加载到Java中的字符串的最简单方法 [英] Simplest way to correctly load html from web page into a string in Java
问题描述
非常感谢!
一个非常常见的错误是无法正确地将HTTP响应从字节转换为字符。要做到这一点,你必须知道响应的字符编码。希望这是在Content-Type参数中指定的参数。但是将它放在body中,作为 meta
标记中的http-equiv属性也是一个选项。
因此,将页面正确加载到 String
中非常复杂,甚至像HttpClient这样的第三方库也没有提供通用解决方案。
下面是一个简单的实现,它将处理最常见的情况:
URL url =新的URL(http://stackoverflow.com/questions/1381617);
URLConnection con = url.openConnection();
Pattern p = Pattern.compile(text / html; \\s + charset =([^ \\\s] +)\\s *);
Matcher m = p.matcher(con.getContentType());
/ *如果Content-Type与此前概念不匹配,请选择default和
*以获得最佳效果。 * /
String charset = m.matches()? m.group(1):ISO-8859-1;
读者r =新的InputStreamReader(con.getInputStream(),charset);
StringBuilder buf = new StringBuilder();
while(true){
int ch = r.read();
if(ch <0)
break;
buf.append((char)ch);
}
String str = buf.toString();
Just what the title says.
Help greatly appreciated!
An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a meta
tag is also an option.
So, it is surprisingly complicated to load a page into a String
correctly, and even 3rd party libraries like HttpClient don't offer a general solution.
Here's a simple implementation that will handle the most common case:
URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and
* hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
Reader r = new InputStreamReader(con.getInputStream(), charset);
StringBuilder buf = new StringBuilder();
while (true) {
int ch = r.read();
if (ch < 0)
break;
buf.append((char) ch);
}
String str = buf.toString();
这篇关于将网页中的html正确加载到Java中的字符串的最简单方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!