为什么BufferedReader没有以指定的UTF-8格式读取? [英] Why does this BufferedReader not read in the specified UTF-8 Format?

查看:226
本文介绍了为什么BufferedReader没有以指定的UTF-8格式读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在刮几个网站,其中一些网站包含非拉丁字符和特殊字符,如,而不是'用于撇号而不是'



这是真正的曲线球...



我将相关文本打印到控制台,当我运行它在我的IDE(Netbeans),但是当我在我的电脑上运行我需要你的帮助打印出来:ΓÇ£我需要在任何人说我需要设置我的 JAVA_TOOL_OPTIONS 之前,您的HelpΓÇ¥ ...



<环境变量为 -Dfile.encoding = UTF8 让我说我已经做到了,这仍然是一个问题,此外,我不应该指定缓冲区的编码阅读器要UTF-8覆盖这一点?



这里有一些信息:




  • 我正在使用JDK 7与目标平台作为1.7

  • 我正在Windows 7机器上运行所有正在运行的机器,并遇到相同的问题(有些没有 JAVA_TOOL_OPTIONS 设置,但这似乎没有任何区别)。

  • 我认为它使用的默认编码是Cp1252 ...



这是我的代码。让我知道你是否需要更多的信息。谢谢!

  / ** 
*使用给定的网址,此方法创建并返回该URL的缓冲阅读器
*
* @param urlString
* @return
* @throws MalformedURLException
* @throws IOException
* /
public synchronized static BufferedReader getBufferedReader(String urlString)throws MalformedURLException,IOException {
URL url = new URL(urlString);
InputStream is = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is,UTF-8));
return br;
}


解决方案

这里有两种可能性。正如 user1291492 所说,可能是您正确阅读内容,但您的终端使用的编码与您使用的编码不同。



另一种可能性是源数据不在UTF-8中。如果您正在刮网站,那么您应该注意网站告诉您通过 Content-Type 头文件进行编码的内容,而不是假定它始终是UTF- 8。


I am scraping a few websites and some of them contain non-Latin Characters and special characters like " for quotes rather than " and for apostrophes rather than '.

Here's the real curve ball...

I have the relevant text printed out to the console. Everything encodes fine when I run it in my IDE (Netbeans). But when I run it on my computer "I Need Your Help" is printed out as: ΓÇ£I Need Your HelpΓÇ¥...

Before anyone says I need to set my JAVA_TOOL_OPTIONS Environment Variable to -Dfile.encoding=UTF8 let me say that I have already done that and this is still a problem. Besides, shouldn't my specifying the encoding for the buffered reader to be "UTF-8" override that anyway?

Here's some info:

  • I'm using the JDK 7 with the target platform as 1.7
  • I'm running on a Windows 7 machine for all the machines I'm running this on and experiencing the same problem (some don't have the JAVA_TOOL_OPTIONS set, but that doesn't seem to make any difference).
  • I think the default encoding that it's using is Cp1252...

Here's my code. Let me know whether you need more info. Thanks!

/**
 * Using the given url, this method creates and returns the buffered reader for that url
 *
 * @param urlString
 * @return
 * @throws MalformedURLException
 * @throws IOException
 */
public synchronized static BufferedReader getBufferedReader(String urlString) throws MalformedURLException, IOException {
  URL url = new URL(urlString);
  InputStream is = url.openStream();
  BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
  return br;
}

解决方案

There are two possibilities here. As user1291492 said, it could be that you read the content correctly but the encoding that your terminal uses is different from the one your IDE uses.

The other possibility is that the source data is not in UTF-8. If you're scraping a website, then you should pay attention to what the Website tells you it's using for encoding via the Content-Type header, not assume that it's always UTF-8.

这篇关于为什么BufferedReader没有以指定的UTF-8格式读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆