编码抓取非英语网站的问题 [英] Encoding issues crawling non-english websites

查看：130 发布时间：2017/8/17 0:59:23 java encoding utf-8 internationalization web-crawler

本文介绍了编码抓取非英语网站的问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图将网页的内容作为字符串获取，我发现这个问题解决了如何编写一个基本的网络爬网程序，它声称（并且似乎）处理编码问题，但代码在那里提供，适用于美国/英国网站，无法正确处理其他语言。

I'm trying to get the contents of a webpage as a string, and I found this question addressing how to write a basic web crawler, which claims to (and seems to) handle the encoding issue, however the code provided there, which works for US/English websites, fails to properly handle other languages.

这是一个完整的Java类，演示了我所指的：

Here is a full Java class that demonstrates what I'm referring to:

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class I18NScraper
{
    static
    {
        System.setProperty("http.agent", "");
    }

    public static final String IE8_USER_AGENT = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)";

  //https://stackoverflow.com/questions/1381617/simplest-way-to-correctly-load-html-from-web-page-into-a-string-in-java
    private static final Pattern CHARSET_PATTERN = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
    public static String getPageContentsFromURL(String page) throws UnsupportedEncodingException, MalformedURLException, IOException {
        Reader r = null;
        try {
            URL url = new URL(page);
            HttpURLConnection con = (HttpURLConnection)url.openConnection();
            con.setRequestProperty("User-Agent", IE8_USER_AGENT);

            Matcher m = CHARSET_PATTERN.matcher(con.getContentType());
            /* If Content-Type doesn't match this pre-conception, choose default and 
             * hope for the best. */
            String charset = m.matches() ? m.group(1) : "ISO-8859-1";
            r = new InputStreamReader(con.getInputStream(),charset);
            StringBuilder buf = new StringBuilder();
            while (true) {
              int ch = r.read();
              if (ch < 0)
                break;
              buf.append((char) ch);
            }
            return buf.toString();
        } finally {
            if(r != null){
                r.close();
            }
        }
    }

    private static final Pattern TITLE_PATTERN = Pattern.compile("<title>([^<]*)</title>");
    public static String getDesc(String page){
        Matcher m = TITLE_PATTERN.matcher(page);
        if(m.find())
            return m.group(1);
        return page.contains("<title>")+"";
    }

    public static void main(String[] args) throws UnsupportedEncodingException, MalformedURLException, IOException{
        System.out.println(getDesc(getPageContentsFromURL("http://yandex.ru/yandsearch?text=%D0%A0%D0%B5%D0%B7%D1%83%D0%BB%D1%8C%D1%82%D0%B0%D1%82%D0%BE%D0%B2&lr=223")));
    }
}

哪些输出：

???????????&nbsp;&mdash; ??????: ??????? 360&nbsp;???&nbsp;???????

虽然应该是：

Результатов&nbsp;&mdash; Яндекс: Нашлось 360&nbsp;млн&nbsp;ответов

你能帮我了解我是什么做错了？尝试像强制UTF-8这样的东西不会有帮助，尽管它是源代码和HTTP标头中列出的字符集。

Can you help me understand what I'm doing wrong? Trying things like forcing UTF-8 do not help, despite that being the charset listed in the source and the HTTP header.

编码抓取非英语网站的问题 [英] Encoding issues crawling non-english websites

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

编码抓取非英语网站的问题 [英] Encoding issues crawling non-english websites

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭