Java UTF-8编码未设置为URLConnection [英] Java UTF-8 encoding not set to URLConnection

查看：680 发布时间：2018/12/12 18:39:57 java unicode utf8-decode

本文介绍了Java UTF-8编码未设置为URLConnection的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从 http://api.freebase.com/api/trans中检索数据/ raw / m / 0h47

正如你在文中看到的那样，有这样的歌词： /ældʒɪəriə/ 。

As you can see in text there are sings like this: /ælˈdʒɪəriə/.

当我尝试从页面获取源代码时，我会收到带有&＃250; 等。


When I try to get source from the page I get text with sings like &#250; etc.
到目前为止，我已尝试使用以下代码：
So far I've tried with the following code:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");

我做错了什么？
我的整个代码：
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}

try {
    urlConn = url.openConnection(); 
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");

urlConn.setDoInput(true);
urlConn.setUseCaches(false);

StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
    strBseznam.deleteCharAt(strBseznam.length() - 1);

try {
    input = new DataInputStream(urlConn.getInputStream()); 
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
    while (null != ((str = input.readLine()))) 
    {
        strB.append(str); 
    }
    input.close();
} catch (IOException e) { e.printStackTrace(); }

 
 
推荐答案
 HTML页面采用UTF-8格式，可以使用阿拉伯字符等。但是，Unicode 127以上的字符仍然编码为数字实体，如&＃250; 。由于UTF-8完全正确，因此Accept-Encoding不会，帮助和加载。
The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like &#250;. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.
您必须自己解码实体。类似于：
You have to decode the entities yourself. Something like:
String decodeNumericEntities(String s) {
    StringBuffer sb = new StringBuffer();
    Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
    while (m.find()) {
        int uc = Integer.parseInt(m.group(1));
        m.appendReplacement(sb, "");
        sb.appendCodepoint(uc);
    }
    m.appendTail(sb);
    return sb.toString();
}

顺便说一下，这些实体可以来自已处理的HTML表单，所以编辑网络应用程序的一面。
By the way those entities could stem from processed HTML forms, so on the editing side of the web app.
  有问题的代码：  
After code in question:
我已将DataInputStream替换为文本的（缓冲）Reader。 InputStreams读取二进制数据，字节;读者文字，字符串。 InputStreamReader具有InputStream和编码参数，并返回一个Reader。
I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.
try {
    BufferedReader input = new BufferedReader(
            new InputStreamReader(urlConn.getInputStream(), "UTF-8")); 
    StringBuilder strB = new StringBuilder();
    String str;
    while (null != (str = input.readLine())) {
        strB.append(str).append("\r\n"); 
    }
    input.close();
} catch (IOException e) {
    e.printStackTrace();
}


                        这篇关于Java UTF-8编码未设置为URLConnection的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Java UTF-8编码未设置为URLConnection [英] Java UTF-8 encoding not set to URLConnection

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Java UTF-8编码未设置为URLConnection [英] Java UTF-8 encoding not set to URLConnection

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭