的URLConnection将无法获取完整的HTML [英] URLConnection cannot retrive complete Html

查看:142
本文介绍了的URLConnection将无法获取完整的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试解析网站的信息。但是,它的工作原理,只有当环境也不是很长。作为HTML云大,加载的内容是不完整的。所检索的字符串的总长度大约是40000串的计数检索的每个时间是不同的。 (例如:这就像31358下一次计数31345,第一次和)所以我不能检索整页

I try to parse information from website. However, It works only when the context is not very long. As the Html goes large, the content loaded is incomplete. The total length of the retrieved String is around 40000. The count of the string retrieved each time is different. (ex: That is like 31345 count for the first time and 31358 next time) So I can not retrieve full page.

结果,我认为这个问题可能与互联网连接或缓冲区。但是我用的BufferedReader,而据我所知HttpURLConnection类像甲流的工作,所以应该不会有任何问题。我有检查,几乎所有涉及到的URLConnection页面,但是这个没有人会谈。

As the result, I assume this problem could be related to internet connection or buffer. But I have used the bufferedReader, and as far as I know HttpURLConnection work like a stream, so there should not have any problem. I have check almost all page related to UrlConnection, but there is no one talks about this.

这有什么错我的code?我一直在这个问题上几天,任何意见会有很大的帮助。先谢谢了。

Is there anything wrong in my code? I have been working on this problem for a few days, Any advice will be very helpful. Thanks in advance.

public String getHtmlFromUrl(String url, int startReadingLine) {
    String xml = "";

    try {

        //URL url1 = new URL(url);
        URL url1 = new URL("http://support.google.com/analytics/bin/answer.py?hl=zh-Hant&answer=1009602");

        HttpURLConnection urlConn = (HttpURLConnection) url1
                .openConnection();

        urlConn.setRequestProperty("User-Agent",
                "Mozilla/5.0 (Windows NT 6.1;zh-tw; MSIE 6.0)");
        if (Integer.parseInt(Build.VERSION.SDK) < Build.VERSION_CODES.FROYO) {
            System.setProperty("http.keepAlive", "false");
        }
        urlConn.setReadTimeout(10000 /* milliseconds */);
        urlConn.setConnectTimeout(15000 /* milliseconds */);
        urlConn.setDoOutput(true);
        urlConn.setDoInput(true);
        urlConn.setRequestMethod("GET");
        urlConn.setUseCaches(false);


        InputStreamReader in = new InputStreamReader(
                urlConn.getInputStream());
        BufferedReader buffer = new BufferedReader(in, 100000);

        StringBuilder builder = new StringBuilder();
        String auxaux = "";



        while ((aux = buffer.readLine()) != null)
            builder.append(aux);

        xml = builder.toString();

        in.close();
        urlConn.disconnect();

    } catch (SocketTimeoutException e) {
        return "time out";
    } catch (IOException e) {
        e.printStackTrace();
    }
    // return XML
    return xml;
}

下面是XML的例子:(计数是40710)

Here is the example of xml: (count to be 40710)

(我没加的...在XML结束)

(I did not add the "..." at end of xml)

<!DOCTYPE html><html lang="zh-Hant"class="streamlined streamlined-3"><head><script type="text/javascript">serverResponseTimeDelta=window.external&&window.external.pageT?window.external.pageT:-1;pageStartTime=new Date().getTime...

   ...

 ..."納米比亞", "NR": "諾魯", "NP": "尼泊爾", "NL": "荷蘭", "AN": "荷屬安地列斯", "KN": "尼維斯", "NC": "新喀里多尼亞", "NI": "尼加拉瓜", "NE": "尼日", "NG": "奈及利亞", "NU": "紐埃", "KR": "北韓", "NO": "挪威", "NZ": "紐西蘭", "OM": "阿曼", "PW": "帛琉", "PK": "巴基斯坦", "PS": "巴勒斯坦", "PA": "巴拿馬", "PG": "巴布亞新幾內亞", "PY": "巴拉圭", "PE": "秘魯", "PH"...

另:(计数41106)

Another: (count 41106)

<!DOCTYPE html><html lang="zh-Hant"class="streamlined streamlined-3"><head><script type="text/javascript">serverResponseTimeDelta=window.external&&window.external.pageT?window.externa...

    ...

...屬安地列斯", "KN": "尼維斯", "NC": "新喀里多尼亞", "NI": "尼加拉瓜", "NE": "尼日", "NG": "奈及利亞", "NU": "紐埃", "KR": "北韓", "NO": "挪威", "NZ": "紐西蘭", "OM": "阿曼", "PW": "帛琉", "PK": "巴基斯坦", "PS": "巴勒斯坦", "PA": "巴拿馬", "PG": "巴布亞新幾內亞", "PY": "巴拉圭", "PE": "秘魯", "PH"...

编辑:
    到目前为止,我认为它有什么用它做与互联网交互的,因为每个结果的数量是不同的,也可能是我设备的一些奇怪的错误的方式。根本原因是尚未被发现。
什么是最奇怪的是,它在结果结尾...。看来,它知道结果还没有完成......

edit: So Far I assume it have something to do with the way it interact with the internet since the count of each result is different, or it could be some weird bug of my device. The root cause is yet to be found. What is the weirdest part is that it ends with "..." in the result. It appears that it knows the result is not complete yet...

推荐答案

应尽量将你的输入写入到外部文件,看看你实际领取!
我对Android的同样的问题了。最后,logcat的didn't告诉我整个字符串!

Always try to write your Input into a external File and look what you actually receive! I had the same Problem on Android too. In the End, logcat didn´t show me the whole String!

这篇关于的URLConnection将无法获取完整的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆