使用Jsoup解析Html内容 [英] Parsing Html content using Jsoup

查看:98
本文介绍了使用Jsoup解析Html内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的HTML源代码

 < li> 
< a href =/ info / some1>项目1< br>
< span class =deets> 111< / span>
< / a>
< / li>

< li>
< a href =/ info / some2>第2项< br>
< span class =deets> 222< / span>
< / a>
< / li>

< li>
< a href =/ info / some3>项目3< br>
< span class =deets> 333< / a>
< / li>

这是我的Java程序来获取内容并过滤HTML标签

 尝试{
myurl =新网址(http://www.somewebsite.com);
HttpURLConnection con =(HttpURLConnection)myurl.openConnection();

InputStream result = con.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(result));
StringBuilder sb = new StringBuilder();

for(String line;(line = reader.readLine())!= null;)
//追加所有内容&使用行分隔符分隔
sb.append(line).append(System.getProperty(line.separator));
String final_result = sb.toString()。replaceAll(\\< ;。*?\ \\>中 );

TextView的TV =(TextView的)findViewById(R.id.textView1);
tv.setText(final_result);



$ b catch(Exception e){
// TODO自动生成的catch块
e.printStackTrace();
tv.setText(not working);

$ / code>




  1. 有没有更简单的方法Jsoup使用Java来解析HTML内容而不是正则表达式

  2. 有没有办法只获取所需的内容。所以这里我只想要内容Item 2 - 222

     < li> 
    < a href =/ info / some2>项目2< br>
    < span class =deets> 222< / a>



解决

  //解析html页面
document doc = Jsoup.connect(http://www.website.com).get();
Document doc1 = Jsoup.parse(< html>< head>< title> First < / head>+< body>< p>解析HTML到文档。< / p>< / body>< / html>);

字符串内容= doc.body().text();

//获取特定元素,如链接
元素链接= doc.select(a [href] );
for(Element e:links){
System.out.println(link:+ e.attr(abs:href));
}

要了解更多信息,请访问 Jsoup文档


This is my HTML source

             <li>
                 <a href="/info/some1>Item 1<br>
                    <span class="deets">111</span>
                 </a>
             </li>

             <li>
                 <a href="/info/some2>Item 2<br>
                    <span class="deets">222</span>
                 </a>
             </li>

             <li>
                 <a href="/info/some3>Item 3<br>
                    <span class="deets">333</span>
                 </a>
             </li>

This is my Java program to get the content & it filters the HTML tags

    try {   
        myurl = new URL("http://www.somewebsite.com");  
        HttpURLConnection con= (HttpURLConnection) myurl.openConnection();

        InputStream result = con.getInputStream();
        BufferedReader reader = new BufferedReader(new InputStreamReader(result));
        StringBuilder sb = new StringBuilder();

        for(String line; (line = reader.readLine()) != null;)
            //append all content & separate using line separator
        sb.append(line).append(System.getProperty("line.separator"));
        String final_result = sb.toString().replaceAll("\\<.*?\\>", "");    

        TextView tv=(TextView) findViewById(R.id.textView1); 
        tv.setText(final_result);


    } 

    catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        tv.setText("not working");
    }

  1. Is there an easier way using Jsoup to parse the HTML content using Java instead of Regex

  2. Is there a way to get only the required contents. So here I just want the contents "Item 2 - 222"

             <li>
                 <a href="/info/some2>Item 2<br>
                    <span class="deets">222</span>
                 </a>
             </li>
    

解决方案

Try this for easy parsing using jsoup:

// To parse the html page
Document doc = Jsoup.connect("http://www.website.com").get();
Document doc1 = Jsoup.parse("<html><head><title>First parse</title></head>" + "<body> <p>Parsed HTML into a doc.</p></body></html>");

String content = doc.body().text();

// To get specific elements such as links
Element links = doc.select("a[href]");
for(Element e: links){
    System.out.println("link: " + e.attr("abs:href"));
}

To learn more, visit Jsoup Docs

这篇关于使用Jsoup解析Html内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆