Java UTF-8编码未设置为URLConnection [英] Java UTF-8 encoding not set to URLConnection
问题描述
我正在尝试从 http://api.freebase.com/api/trans中检索数据/ raw / m / 0h47
正如你在文中看到的那样,有这样的歌词: /ældʒɪəriə/
。
As you can see in text there are sings like this: /ælˈdʒɪəriə/
.
当我尝试从页面获取源代码时,我会收到带有ú $ c $等字样的文字c>等。
When I try to get source from the page I get text with sings like ú
etc.
到目前为止,我已尝试使用以下代码:
So far I've tried with the following code:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");
我做错了什么?
我的整个代码:
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}
try {
urlConn = url.openConnection();
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
urlConn.setDoInput(true);
urlConn.setUseCaches(false);
StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
strBseznam.deleteCharAt(strBseznam.length() - 1);
try {
input = new DataInputStream(urlConn.getInputStream());
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
while (null != ((str = input.readLine())))
{
strB.append(str);
}
input.close();
} catch (IOException e) { e.printStackTrace(); }
推荐答案
HTML页面采用UTF-8格式,可以使用阿拉伯字符等。但是,Unicode 127以上的字符仍然编码为数字实体,如ú
。由于UTF-8完全正确,因此Accept-Encoding不会,帮助和加载。
The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú
. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.
您必须自己解码实体。类似于:
You have to decode the entities yourself. Something like:
String decodeNumericEntities(String s) {
StringBuffer sb = new StringBuffer();
Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
while (m.find()) {
int uc = Integer.parseInt(m.group(1));
m.appendReplacement(sb, "");
sb.appendCodepoint(uc);
}
m.appendTail(sb);
return sb.toString();
}
顺便说一下,这些实体可以来自已处理的HTML表单,所以编辑网络应用程序的一面。
By the way those entities could stem from processed HTML forms, so on the editing side of the web app.
有问题的代码:
After code in question:
我已将DataInputStream替换为文本的(缓冲)Reader。 InputStreams读取二进制数据,字节;读者文字,字符串。 InputStreamReader具有InputStream和编码参数,并返回一个Reader。
I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.
try {
BufferedReader input = new BufferedReader(
new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
StringBuilder strB = new StringBuilder();
String str;
while (null != (str = input.readLine())) {
strB.append(str).append("\r\n");
}
input.close();
} catch (IOException e) {
e.printStackTrace();
}
这篇关于Java UTF-8编码未设置为URLConnection的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!