使用JSoup爬取XML [英] Scraping XML with JSoup

查看:82
本文介绍了使用JSoup爬取XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取这里.

此刻,我只是想绕着JSoup,所以下面的代码仅仅是概念的证明(至少是对它的尝试).

At the moment I'm just trying to wrap my head around JSoup, so the following code is merely proof of concept (or an attempt at it, at least).

    public static void grabShakers(String url) throws IOException {

    doc = Jsoup.connect(url).get();


    desc = doc.select("title");
    links = doc.select("link");
    price = doc.select("span.price");

}

它完美地抓住了每个项目的标题.每个链接的输出仅是十个重复的关闭链接标签,并且永远找不到任何价格.我以为CDATA可能是问题所在,所以我将doc转换为html,使用.replace删除了注释,然后将其转换回Document进行解析无济于事.任何见识将不胜感激.

It grabs the title of each item perfectly. The output of each link is simply ten repeated closing link tags and it never finds any prices. I thought perhaps the CDATA was the issue, so I converted doc to html, stripped out the comments using .replace, and then converted it back to a Document for parsing to no avail. Any insight would be greatly appreciated.

以下代码是我用来打印每个元素的代码:

The following code is what I'm using to print out each element:

for (Element src : price) {
        System.out.println(src);
    }

推荐答案

该供稿有两个问题:

  1. 文档仅包含<link />..actual link..而不是完整链接标记
  2. 描述(包含 price 标签)是转义的 HTML,不会被解析
  1. The document contains only <link />..actual link.. instead of full link tag
  2. The description (containing the price tag) is escaped Html, which wont get parsed

解决方案:

    final String url = "http://www.amazon.com/gp/rss/movers-and-shakers/appliances/ref=zg_bsms_appliances_rsslink";
    Document doc = Jsoup.connect(url).get();


    for( Element item : doc.select("item") ) // Select all items
    {
        final String title = item.select("title").first().text(); // select the 'title' of the item
        final String link = item.select("link").first().nextSibling().toString().trim(); // select 'link' (-1-)

        final Document descr = Jsoup.parse(StringEscapeUtils.unescapeHtml4(item.select("description").first().toString()));
        final String price = descr.select("span.price").first().text(); // select 'price' (-2-)

        // Output - Example
        System.out.println(title);
        System.out.println(link);
        System.out.println(price);
        System.out.println();
    }

注释1:链接的解决方法;选择(空)link标记并获取 next 节点(=带有实际链接的TextNode)的文本.

Note 1: Workaround for the link; select the (empty) link tag and get the text of next Node (= TextNode with the actual link).

注释2:解决价格问题的方法;选择description标记,对html进行转义,解析并选择价格.对于转义,我使用了 Apache Commons Lang 中的> StringEscapeUtils.unescapeHtml4().

Note 2: Workaround for price; select the description tag, unescape the html, parse it and select the price. For unescaping i used StringEscapeUtils.unescapeHtml4() from Apache Commons Lang.

输出:
(使用上方的链接)

Output:
(using link from above)

#1: Epicurean Gourmet Series 20-Inch-by-15-Inch Cutting Board with Cascade Effect, Nutmeg with Natural Core
http://www.amazon.com/Epicurean-Gourmet-20-Inch-15-Inch-Cutting/dp/B003MU9PLU/ref=pd_zg_rss_ms_la_appliances_1
$72.95

#2: GE 45600 Z-Wave Basic Handheld Remote
http://www.amazon.com/GE-45600-Z-Wave-Handheld-Remote/dp/B0013V6RW0/ref=pd_zg_rss_ms_la_appliances_2
$3.00

#3: First Alert RD1 Radon Gas Test Kit
http://www.amazon.com/First-Alert-RD1-Radon-Test/dp/B00002N83E/ref=pd_zg_rss_ms_la_appliances_3
$10.60

#4: Presto 04820 PopLite Hot Air Popper, White
http://www.amazon.com/Presto-04820-PopLite-Popper-White/dp/B00006IUWA/ref=pd_zg_rss_ms_la_appliances_4
$9.99

#5: New 20 oz Espresso Coffee Milk Frothing Pitcher, Stainless Steel, 18/8 gauge
http://www.amazon.com/Espresso-Coffee-Frothing-Pitcher-Stainless/dp/B000FNK3Z4/ref=pd_zg_rss_ms_la_appliances_5
$8.19

#6: PUR 18 Cup Dispenser with One Pitcher Filter DS-1800Z
http://www.amazon.com/PUR-Dispenser-Pitcher-Filter-DS-1800Z/dp/B0006MQCA4/ref=pd_zg_rss_ms_la_appliances_6
$22.17

#7: Hamilton Beach 70610 500-Watt Food Processor, White
http://www.amazon.com/Hamilton-Beach-70610-500-Watt-Processor/dp/B000SAOF5S/ref=pd_zg_rss_ms_la_appliances_7
$21.95

#8: West Bend 77203 Electric Can Opener, Metallic
http://www.amazon.com/West-Bend-77203-Electric-Metallic/dp/B00030J1U2/ref=pd_zg_rss_ms_la_appliances_8
$35.79

#9: Custom Leathercraft 2077L Black Ski Glove, Large
http://www.amazon.com/Custom-Leathercraft-2077L-Black-Glove/dp/B00499BS9A/ref=pd_zg_rss_ms_la_appliances_9
$8.83

#10: Cuisinart CPC-600 1000-Watt 6-Quart Electric Pressure Cooker, Brushed Stainless and Matte Black
http://www.amazon.com/Cuisinart-CPC-600-1000-Watt-Electric-Stainless/dp/B000MPA044/ref=pd_zg_rss_ms_la_appliances_10
$64.95

这篇关于使用JSoup爬取XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆