如何检测网页的字符集 [英] how can I detect charset of a web page

查看:86
本文介绍了如何检测网页的字符集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只想获取Java语言的网页源,并且只想获取具有正确编码类型的内容.到目前为止,我已经可以获取网页的内容.但是对于某些网页,内容带有荒谬的字符.因此,我需要检测该网页的字符集.

I just want to get the web page source in java language and I just want to get that content with correct encoding type. I am able to get the content of a web page till now. But for some web pages the content comes with absurd characters. So I need to detect charset of that web page.

根据我的小研究,我发现有一个jChardet库可以做到这一点.但是我无法将其导入到我的项目中.有人可以帮我吗?

According to my little research I found that there is a jChardet library to do this. But I couldn't import it to my project. Can someone please help me?

顺便说一下,下面的代码就是读取网页内容的代码

By the way the code below is the code to read the web page content

  StringBuilder builder = new StringBuilder(); 
  InputStream is = fURL.openStream();
  BufferedReader buffer = null;
  buffer = new BufferedReader(new InputStreamReader(is, encodingType));

  int byteRead;
  while ((byteRead = buffer.read()) != -1) {
    builder.append((char) byteRead);
  }
  buffer.close();  

  return builder;

推荐答案

读取HTTP响应的 Content-Type 标头,这是获取字符集的最佳方法.仅在没有其他选择的情况下才进行猜测-可以.

Read the Content-Type header of the HTTP response, it's the best way to get the charset. Only apply guessing when you have no alternatives - you do.

这篇关于如何检测网页的字符集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆