编码抓取非英语网站的问题 [英] Encoding issues crawling non-english websites

查看:130
本文介绍了编码抓取非英语网站的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将网页的内容作为字符串获取,我发现这个问题解决了如何编写一个基本的网络爬网程序,它声称(并且似乎)处理编码问题,但代码在那里提供,适用于美国/英国网站,无法正确处理其他语言。

I'm trying to get the contents of a webpage as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages.

这是一个完整的Java类,演示了我所指的:

Here is a full Java class that demonstrates what I'm referring to:

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

哪些输出:

???????????&nbsp;&mdash; ??????: ??????? 360&nbsp;???&nbsp;???????

虽然应该是:

Результатов&nbsp;&mdash; Яндекс: Нашлось 360&nbsp;млн&nbsp;ответов

你能帮我了解我是什么做错了?尝试像强制UTF-8这样的东西不会有帮助,尽管它是源代码和HTTP标头中列出的字符集。

Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 do not help, despite that being the charset listed in the source and the HTTP header.

推荐答案

您所看到的问题是,Mac上的编码不支持西里尔文脚本。我不知道在Oracle JVM上是否正确,但是当Apple生产自己的JVM时, Java的默认字符编码是MacRoman。

The problem you are seeing is that the encoding on your Mac doesn't support Cyrillic script. I'm not sure if it's true on an Oracle JVM, but when Apple was producing their own JVMs, the default character encoding for Java was MacRoman.

当您启动程序时,请指定 file.encoding 系统属性将字符编码设置为UTF-8(这是Mac OS X默认使用的) 。请注意,您必须在启动时设置它: java -Dfile.encoding = UTF-8 ... ;如果您以编程方式设置(调用 System.setProperty()),则为时已晚,设置将被忽略。

When you start your program, specify the file.encoding system property to set the character encoding to UTF-8 (which is what Mac OS X uses by default). Note that you have to set it when you launch: java -Dfile.encoding=UTF-8 ...; if you set it programatically (with a call to System.setProperty()), it's too late, and the setting will be ignored.

每当Java需要将字符编码为bytes—例如,当将文本转换为字节以写入标准输出或错误流时,它将使用默认值,除非您明确指定不同的字符。如果默认编码不能对特定字符进行编码,则替换合适的替换字符。

Whenever Java needs to encode characters to bytes—for example, when it's converting text to bytes to write to the standard output or error streams—it will use the default unless you explicitly specify a different one. If the default encoding can't encode a particular character, a suitable replacement character is substituted.

如果编码可以处理使用的Unicode替换字符U + FFFD(&#xFFFD;)。否则,问号(?)是常用的替换字符。

If the encoding can handle the Unicode replacement character, U+FFFD, (�) that's used. Otherwise, a question mark (?) is a commonly used replacement character.

这篇关于编码抓取非英语网站的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆