问题与jsoup解析HTML [英] Issue on parsing Html with jsoup

查看:203
本文介绍了问题与jsoup解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用jsoup来解析这个 HTML

I am trying to parse this HTML using jsoup.

我的code是:

doc = Jsoup.connect(htmlUrl).timeout(1000 * 1000).get();

            Elements items = doc.select("item");
            Log.d(TAG, "Items size : " + items.size());
            for (Element item : items) {
                Log.d(TAG, "in for loop of items");

                Element titleElement = item.select("title").first();
                mTitle = titleElement.text().toString();
                Log.d(TAG, "title is : " + mTitle);

                Element linkElement = item.select("link").first();
                mLink = linkElement.text().toString();
                Log.d(TAG, "link is : " + mLink);

                Element descElement = item.select("description").first();
                mDesc = descElement.text().toString();
                Log.d(TAG, "description is : " + mDesc);


            }

我得到以下输出:

I am getting following output:

in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : 
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.<div class="feedflare"> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.reuters.com/~ff/reuters/audio/newsmakerus/rss/mp3?a=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/reuters/audio/newsmakerus/rss/mp3?i=NX3AY96GfGk:hAtGeOq2ESs:F7zBnMyn0Lo" border="0"></img></a> </div><img src="http://feeds.feedburner.com/~r/reuters/audio/newsmakerus/rss/mp3/~4/NX3AY96GfGk" height="1" width="1"/>

不过,我想作为输出:

But I want output as:

in for loop of items
D/HtmlParser( 6690): title is : Indonesian president: Some multinationals "take too much"
D/HtmlParser( 6690): link is : http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/KDcQe4gF-3U/62828262.mp3  
D/HtmlParser( 6690): description is : April 23 - Indonesian President Susilo Bambang Yudhoyono tells a Thomson Reuters Newsmaker event that the country welcomes foreign investment in its resources sector, but must receive a "fair share" of benefits.

我应该我在code改变?

What should I change in my code?

如何实现我的目标。请帮助我!

How to achieve my goal. Please help me!!

感谢你在前进!

推荐答案

有在新闻 2的内容问题,你牵强。

There are 2 problems in rss content you fetched.


  1. 链接文本不是&LT内;链接/&GT; 标记,但在它之外

  2. 有一些转义的HTML 说明中的内容标记。

  1. The link text is not within the <link/> tag but outside of it.
  2. There is some escaped html content within the description tag.

PFB修改code。

PFB the modified code.

另外,我发现了一些干净的HTML内容查看时,网​​址浏览器,它解析的时候会让你很容易提取所需的字段。您可以实现这一设置的userAgent 浏览器 Jsoup 。但它由你来决定如何获取内容。

Also I found some clean html content when viewed the URL in Browser, which when parsed will make you easy to extract the desired fields. You can achieve that setting the userAgent as Browser in the Jsoup. But its up to you to decide how to fetch the content.

    doc = Jsoup.connect("http://feeds.reuters.com/reuters/audio/newsmakerus/rss/mp3/").timeout(0).get();
    System.out.println(doc.html());
    System.out.println("================================");
    Elements items = doc.select("item");
    for (Element item : items) {

        Element titleElement = item.select("title").first();
        String mTitle = titleElement.text();
        System.out.println("title is : " + mTitle);

        /*
         * The link in the rss is as follows
         *  <link />http://feeds.reuters.com/~r/reuters/audio/newsmakerus/rss/mp3/~3/NX3AY96GfGk/59621707.mp3 
         *  which doesn't fall in the <link> element but falls under <item> TextNode
         */
        String  mLink = item.ownText(); //  
        System.out.println("link is : " + mLink);

        Element descElement = item.select("description").first();
        /*Unescape the html content, Parse it to a doc, and then fetch only the text leaving behind all the html tags in content
         * "/" is a dummy baseURI passed, as we don't care about resolving the links within parsed content.
         */
        String  mDesc = Parser.parse(Parser.unescapeEntities(descElement.text(), false),"/" ).text(); 
        System.out.println("description is : " + mDesc);

    }

这篇关于问题与jsoup解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆