打开与Jsoup的连接,获取状态代码并解析文档 [英] Open a connection with Jsoup, get status code and parse document
问题描述
我正在使用jsoup创建一个类,它将执行以下操作:
I'm creating a class using jsoup that will do the following:
- 构造函数打开与url的连接。
- 我有一个检查页面状态的方法。即200,404等。
- 我有一个方法来解析页面并返回一个网址列表。#
下面是我正在尝试做的粗略工作,而不是非常粗糙,因为我一直在尝试很多不同的事情
Below is a rough working of what I am trying to do, not its very rough as I've been trying a lot of different things
public class ParsePage {
private String path;
Connection.Response response = null;
private ParsePage(String langLocale){
try {
response = Jsoup.connect(path)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
} catch (IOException e) {
System.out.println("io - "+e);
}
}
public int getSitemapStatus(){
int statusCode = response.statusCode();
return statusCode;
}
public ArrayList<String> getUrls(){
ArrayList<String> urls = new ArrayList<String>();
}
}
如你所见,我可以得到页面状态,但使用构造函数已经打开的连接,我不知道如何解析文档,我尝试使用:
As you can see I can get the page status, but using the already open connection from the constructor I don't know how to get the document to parse, I tried using:
Document doc = connection.get();
但这是不行的。有什么建议?或者更好的方法来解决这个问题?
But that's a no go. Any suggestions? Or better ways to go about this?
推荐答案
如JSoup文档中所述解析()将响应的主体解析为 Document
并返回它的方法。
当你拥有它时,你可以用它做任何你想做的事。
As stated in the JSoup Documentation for the Connection.Response type, there is a parse()
method that parse the response's body as a Document
and returns it.
When you have that, you can do whatever you want with it.
例如,看看 getUrls()的实现
public class ParsePage {
private String path;
Connection.Response response = null;
private ParsePage(String langLocale){
try {
response = Jsoup.connect(path)
.userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.timeout(10000)
.execute();
} catch (IOException e) {
System.out.println("io - "+e);
}
}
public int getSitemapStatus() {
int statusCode = response.statusCode();
return statusCode;
}
public ArrayList<String> getUrls() {
ArrayList<String> urls = new ArrayList<String>();
Document doc = response.parse();
// do whatever you want, for example retrieving the <url> from the sitemap
for (Element url : doc.select("url")) {
urls.add(url.select("loc").text());
}
return urls;
}
}
这篇关于打开与Jsoup的连接,获取状态代码并解析文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!