如何检测网页的字符集 [英] how can I detect charset of a web page
问题描述
我只想获取Java语言的网页源,并且只想获取具有正确编码类型的内容.到目前为止,我已经可以获取网页的内容.但是对于某些网页,内容带有荒谬的字符.因此,我需要检测该网页的字符集.
I just want to get the web page source in java language and I just want to get that content with correct encoding type. I am able to get the content of a web page till now. But for some web pages the content comes with absurd characters. So I need to detect charset of that web page.
根据我的小研究,我发现有一个jChardet库可以做到这一点.但是我无法将其导入到我的项目中.有人可以帮我吗?
According to my little research I found that there is a jChardet library to do this. But I couldn't import it to my project. Can someone please help me?
顺便说一下,下面的代码就是读取网页内容的代码
By the way the code below is the code to read the web page content
StringBuilder builder = new StringBuilder();
InputStream is = fURL.openStream();
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, encodingType));
int byteRead;
while ((byteRead = buffer.read()) != -1) {
builder.append((char) byteRead);
}
buffer.close();
return builder;
推荐答案
读取HTTP响应的 Content-Type
标头,这是获取字符集的最佳方法.仅在没有其他选择的情况下才进行猜测-可以.
Read the Content-Type
header of the HTTP response, it's the best way to get the charset. Only apply guessing when you have no alternatives - you do.
这篇关于如何检测网页的字符集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!