通过Jsoup下载的网页源代码不等于实际的网页源代码 [英] web page source downloaded through Jsoup is not equal to the actual web page source

查看:274
本文介绍了通过Jsoup下载的网页源代码不等于实际的网页源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有一个严重的问题。我已经通过堆栈溢出和许多其他网站搜索全部。每一个他们给出相同的解决方案,我已经尝试了所有这些,但我无法解决这个问题。



我有以下代码,

  Document doc = Jsoup.connect(url).timeout(30000).get(); 

这里使用Jsoup库的m和我得到的结果并不等于实际的页面源代码我们可以看到,但右键点击页面 - >页面源。结果中有许多部分缺失,我正在使用上面的代码行。
在谷歌搜索一些网站后,我看到了这个方法,

  URL url = new URL(webPage); 
URLConnection urlConnection = url.openConnection();
urlConnection.setConnectTimeout(10000);
urlConnection.setReadTimeout(10000);
InputStream is = urlConnection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);



int numCharsRead;
char [] charArray = new char [1024];
StringBuffer sb = new StringBuffer(); ((numCharsRead = isr.read(charArray))> 0){
sb.append(charArray,0,numCharsRead);
}
字符串结果= sb.toString();

System.out.println(result);

但是没有运气。
当我在互联网上搜索这个问题时,我看到很多网站,它说我必须在下载网页的页面源代码时设置适当的网页设置和编码类型。但我将如何从我的代码动态地了解这些东西?有没有在Java中的任何类。我也经历了crawler4j,但它对我来说并没有太大的帮助。请帮助家伙。这个问题已经持续了一个多月了。我尽我所能地尝试过。所以最终的希望就是堆栈溢出的神灵一直都在帮助!!

解决方案

最近我有这个。我遇到了某种机器人保护。将原始行更改为:

  Document doc = Jsoup.connect(url)
.userAgent(Mozilla / 5.0 )
.timeout(30000)
.get();


i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.

i have the following code,

Document doc = Jsoup.connect(url).timeout(30000).get();

Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code. After searching some sites on Google, i saw this methid,

URL url = new URL(webPage);
        URLConnection urlConnection = url.openConnection();
        urlConnection.setConnectTimeout(10000);
        urlConnection.setReadTimeout(10000);
        InputStream is = urlConnection.getInputStream();
        InputStreamReader isr = new InputStreamReader(is);



        int numCharsRead;
        char[] charArray = new char[1024];
        StringBuffer sb = new StringBuffer();
        while ((numCharsRead = isr.read(charArray)) > 0) {
            sb.append(charArray, 0, numCharsRead);
        }
        String result = sb.toString();          

        System.out.println(result);   

But no Luck. While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!

解决方案

I had this recently. I'd run into some sort of robot protection. Change your original line to:

Document doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0")
                    .timeout(30000)
                    .get();

这篇关于通过Jsoup下载的网页源代码不等于实际的网页源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆