如何使用JSoup解析HTML文档以获取链接列表? [英] How do I parse an HTML document with JSoup to get a list of links?

查看:98
本文介绍了如何使用JSoup解析HTML文档以获取链接列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析 http://www.craigslist.org/about/sites 构建一组文本/链接以使用此信息动态加载程序.到目前为止,我已经做到了:

I am trying to parse http://www.craigslist.org/about/sites to build a set of text/links to load a program dynamically with this information. So far I have done this:

Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements elms = doc.select("div.colmask"); // gets 7 countries

在此标签之下,有我尝试获取的doc.select("div.state_delimiter,ul")标签.我设置了迭代器,并进行了一段时间的查找并调用iterator.next().outerHtml();.我可以看到每个国家/地区的所有标签.

Below this tag there are doc.select("div.state_delimiter,ul") tags I am trying to get. I setup my iterator and go into a while look and call iterator.next().outerHtml();. I see all the tags for each country.

我如何逐步浏览每个div.state_delimiter,拉出该文本,然后下移直到 有一个</ul>定义了各个县/市链接/文本的状态结尾?

How can I step through each div.state_delimiter, pull that text then go down until there is a </ul> which defines the end of the states individual counties/cities links/text?

我一直在玩这个游戏,可以通过将outerHtml()设置为String然后手动解析字符串来做到这一点,但是我敢肯定有一种更简单的方法可以做到这一点.我已经尝试过text()并且也尝试了attr("div.state_delimiter"),但是我想我正在弄乱模式/例程来正确地做到这一点.想知道是否有人可以在这里帮助我,并向我展示如何将div.state_delimiter放入文本字​​段,然后<ul><li></li></ul>我希望每个状态的<ul></ul>下的所有<li></li>.希望抢到http://&&尽可能轻松地附带的html.

I was playing around with this and can do it by setting outerHtml() to a String and then parsing the string manually, but I am sure there is an easier way to do this. I have tried text() and also tried attr("div.state_delimiter"), but I think I am messing up the pattern/routine to do this properly. Was wondering if someone could help me out here and show me how to get the div.state_delimiter into a text field and then the <ul><li></li></ul> I want all the <li></li> under the <ul></ul> for each state. Looking to grab the http:// && html that goes along with it as easy as possible.

推荐答案

包含城市的<ul><div class="state_delimiter">的下一个兄弟.您可以使用 Element#nextElementSibling() 来抓取它从那个div开始.这是一个启动示例:

The <ul> containing the cities is the next sibling of the <div class="state_delimiter">. You can use Element#nextElementSibling() to grab it from that div on. Here's a kickoff example:

Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements countries = document.select("div.colmask");

for (Element country : countries) {
    System.out.println("Country: " + country.select("h1.continent_header").text());
    Elements states = country.select("div.state_delimiter");

    for (Element state : states) {
        System.out.println("\tState: " + state.text());
        Elements cities = state.nextElementSibling().select("li");

        for (Element city : cities) {
            System.out.println("\t\tCity: " + city.text());
        }
    }
}

doc.select("div.state_delimiter,ul")不会执行您想要的操作.它返回文档的 all <div class="state_delimiter"> <ul>元素.如果您已经有了HTML解析器,则通过字符串函数手动解析它就没有任何意义.

The doc.select("div.state_delimiter,ul") doesn't do what you want. It returns all <div class="state_delimiter"> and <ul> elements of the document. Manually parsing it by string functions makes no sense if you've already a HTML parser at hands.

这篇关于如何使用JSoup解析HTML文档以获取链接列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆