处理连接错误和JSoup [英] Handling connection errors and JSoup

查看:152
本文介绍了处理连接错误和JSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个应用程序,以便从网站上的多个网页中抓取内容。我使用JSoup连接。这是我的代码:

  for(String locale:langList){
sitemapPath = sitemapDomain +/+ locale + /+ sitemapName;
try {
文档doc = Jsoup.connect(sitemapPath)
.userAgent(Mozilla / 5.0(X11; Linux x86_64)AppleWebKit / 535.21(KHTML,像Gecko)Chrome / 19.0.1042.0 Safari / 535.21)
.timeout(10000)
.get();

元素element = doc.select(loc);
for(Element urls:element){
System.out.println(urls.text());
}
} catch(IOException e){
System.out.println(e);
}
}

大部分时间都可以正常工作。



首先,有时404状态会返回,或者500状态可能是301.下面我的代码将只打印错误,并移动到下一个网址。我想要能够做的是尝试能够返回所有链接的url状态。如果页面连接打印200,如果不打印相关的状态代码。



其次我有时候抓到这个错误java.net.SocketTimeoutException:Read timed out增加我的超时,但我更喜欢尝试连接3次,第三次失败时,我想添加到一个失败数组的网址,所以我可以重试失败的连接在未来。



比我更知识的人能帮助我吗?

解决方案

对于你的第一个问题,进行连接/读取两个步骤,停止请求中间的状态代码,如下所示:

  Connection.Response响应= Jsoup.connect(sitemapPath)
.userAgent(Mozilla / 5.0(X11; Linux x86_64)AppleWebKit / 535.21(KHTML,like Gecko)Chrome / 19.0.1042.0 Safari / 535.21)
.timeout 10000)
.execute();

int statusCode = response.statusCode();
if(statusCode == 200){
Document doc = connection.get();
元素element = doc.select(loc);
for(Element urls:element){
System.out.println(urls.text());
}
}
else {
System.out.println(received error code:+ statusCode);
}

注意 execute()方法将失败,并且 IOException 如果无法连接到服务器,如果响应格式不正确的HTTP等,所以你需要处理。但是,只要服务器说出有意义的东西,您就可以读取状态代码并继续。此外,如果您请求Jsoup跟踪重定向,您将不会看到 30x 响应代码b / c Jsoup将设置从最后一页获取的状态代码。 / p>

至于你的第二个问题,你所需要的是一个循环的代码示例,我只是给你,用try / catch块包装 SocketTimeoutException 。当捕获异常时,循环应该继续。如果你能够获得数据,然后返回或断开。如果您需要更多帮助,请说出来!


I'm trying to create an application to scrape content off of multiple pages on a site. I am using JSoup to connect. This is my code:

for (String locale : langList){
        sitemapPath = sitemapDomain+"/"+locale+"/"+sitemapName;
        try {
            Document doc = Jsoup.connect(sitemapPath)
                    .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                    .timeout(10000)
                    .get();

            Elements element = doc.select("loc");   
            for (Element urls : element) {
                System.out.println(urls.text());
                }
        } catch (IOException e) {
            System.out.println(e);
        }
    }

Everything works perfectly most of the time. However there are a few things I want to be able to do.

First off sometimes a 404 status will return or a 500 status maybe a 301. With my code below it will just print the error and move onto the next url. What I would like to be able to do is try to be able to return the url status for all links. If the page connects print a 200, if not print the relevant status code.

Secondly I sometimes catch this error "java.net.SocketTimeoutException: Read timed out" I could increase my timeout however I would prefer to try to connect 3 times, upon failing the 3rd time I want to add the URL to a "failed" array so I can retry the failed connections in the future.

Can someone with more knowledge than me help me out?

解决方案

For your first question, you can do your connection/read in two steps, stopping to ask for the status code in the middle like so:

Connection.Response response = Jsoup.connect(sitemapPath)
                        .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                        .timeout(10000)
                        .execute();

int statusCode = response.statusCode();
if(statusCode == 200) {
    Document doc = connection.get();
    Elements element = doc.select("loc");   
    for (Element urls : element) {
        System.out.println(urls.text());
    }
}
else {
    System.out.println("received error code : " + statusCode);
}

Note that the execute() method will fail with an IOException if it's unable to connect to the server, if the response is malformed HTTP, etc., so you'll need to handle that. However, as long as the server said something that made sense, you'll be able to read the status code and continue. Also, if you've asked Jsoup to follow redirects, you won't be seeing 30x response codes b/c Jsoup will set the status code from the final page fetched.

As for your second question, all you need is a loop around the code sample I just gave you that's wrapped with a try/catch block with SocketTimeoutException. When you catch the exception, the loop should continue. If you're able to get data, then return or break. Shout if you need more help with it!

这篇关于处理连接错误和JSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆