JSoup解析HTML [英] JSoup parsing HTML

查看:107
本文介绍了JSoup解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图分析我检索与JSOUP一个InputStream的非结构良好的DTD HTML文件,并得到在TD领域的所有数据。 我怎样才能做到这一点与JSoup? 我已经看了看 http://jsoup.org/cookbook/ 但我应该需要SOM的例子来得到它开始了。

感谢你在前进。

我已经试过的SAXParser但我不能老是让DTD工作。

< D​​OCTYPE HTML PUBLIC - // W3C // DTD XHTML 1.0 Strict标准// ENHTTP:/ /www.w3.org/TR/xhtml1/DTD/xhtml1- strict.dtd> < HTML的xmlns =htt​​p://www.w3.org/1999/xhtmlXML:LANG =NL郎=NL> <表类= personaltable CELLSPACING = 0的cellpadding = 0>  < TBODY>   < TR类= alternativerow>    &其中; TD>的Nieuw beltegoed:其中; / TD>    < TD>€1,00< / TD>< / TR>   &其中; TR>    < TD> Tegoed vorige periode:    < TD>€2,00< / TD>< / TD>< / TR>   < TR类= alternativerow>    < TD> Tegoed TOT 09-11-2011:    < TD>€收费10,00< / TD>< / TD>< / TR>   &其中; TR>    &其中; TD>    < TD高度= 25>< / TD>   < TR类= alternativerow>    < TD> Verbruik sinds的Nieuw tegoed:LT; / TD>    < TD>€0,33< / TD>< / TR>   &其中; TR>    &其中; TD> Ongebruikt tegoed:其中; / TD>    < TD>€12,00< / TD>< / TR>   < TR类= alternativerow>    < TD类= F-橙色> Verbruik范博文bundel:LT; / TD>    < TD类= F-橙色>€0,00< / TD>< / TR>   &其中; TR>    < TD> Verbruik DAT Niet的中日bundel青春痘*:< / TD>    < TD>€0,00< / TD>< / TR>   < / TBODY>  < /表> < / HTML>

编辑: 我得到一个强制关闭,我需要在我的AsyncTask的JSoup。 这里是logcat的:

  10-20 21:07:36.679:ERROR / AndroidRuntime(1396):致命异常:主要
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):显示java.lang.NullPointerException
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在com.sencide.AndroidLogin $ MyTask.onPostExecute(AndroidLogin.java:276)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在com.sencide.AndroidLogin $ MyTask.onPostExecute(AndroidLogin.java:1)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在android.os.AsyncTask.finish(AsyncTask.java:417)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在android.os.AsyncTask.access $ 300(AsyncTask.java:127)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在android.os.AsyncTask $ InternalHandler.handleMessage(AsyncTask.java:429)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在android.os.Handler.dispatchMessage(Handler.java:99)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在android.os.Looper.loop(Looper.java:130)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在android.app.ActivityThread.main(ActivityThread.java:3835)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在java.lang.reflect.Method.invokeNative(本机方法)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在java.lang.reflect.Method.invoke(Method.java:507)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在com.android.internal.os.ZygoteInit $ MethodAndArgsCaller.run(ZygoteInit.java:847)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在com.android.internal.os.ZygoteInit.main(ZygoteInit.java:605)
10-20 21:07:36.679:ERROR / AndroidRuntime(1396):在dalvik.system.NativeStart.main(本机方法)
 

下面是AsyncTask的code:

 公共类MyTask的扩展AsyncTask的<字符串,整数,字符串> {
    私有元素tdsFromSecondColumn = NULL;
}

保护字符串doInBackground(字符串... PARAMS){
      InputStream的inputStreamActivity = response.getEntity()的getContent()。

                的BufferedReader读卡器=新的BufferedReader(新的InputStreamReader(inputStreamActivity));
                StringBuilder的SB =新的StringBuilder();
                串线= NULL;

                而((行= reader.readLine())!= NULL){
                    sb.append(行+\ N);
                }

                / *******密切的联系和STREAM ******* /

                的System.out.println(某人);
                inputStreamActivity.close();

                KPN的字符串;
                KPN = sb.toString();

                文档DOC = Jsoup.parse(KPN);
                元件tdsFromSecondColumn = doc.select(table.personaltable TD:当量(1));
}

@覆盖
    保护无效onPostExecute(字符串结果){
        // publishProgress(假);
        TextView的电视=(TextView中)findViewById(R.id.lbl_top);

        对于(元tdFromSecondColumn:tdsFromSecondColumn){
            //System.out.println(tdFromSecondColumn.text());
            tv.setText();
            tv.setText(tdFromSecondColumn.text());
        }
}
}
 

解决方案

所以,你有一个的InputStream ,而不是一个网址?那么你应该使用的<一个href="http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parse%28java.io.InputStream,%20java.lang.String,%20java.lang.String%29"><$c$c>Jsoup#parse()方法,需要一个的InputStream

文件文档= Jsoup.parse(InputStream中,的charsetName,基本URI); // ...

的charsetName 应该是字符集的文件原本是连接codeD。你可以把它让Jsoup决定还是回退到UTF-8。该基本URI 应该是从该HTML最初担任的URL。你可以把它,你不仅没有能够解决相对链接。

但是,如果你确实有原始URL,那么你也可以只使用 Jsoup#连接()

文件文档= Jsoup.connect(URL)获得(); // ...

无论你获得的文件,你可以使用的 CSS选择器来选择文档中的利益因素。另请参见关于这个问题的 Jsoup食谱。下面是其中提取物的第二列中的所有数据为例&LT;表&gt; personaltable 的类名

元素tdsFromSecondColumn = document.select(table.personaltable TD:EQ(1)); 对于(元tdFromSecondColumn:tdsFromSecondColumn){     的System.out.println(tdFromSecondColumn.text()); }

这会导致:

€1,00 €2,00 €收费10,00 €0,33 €12,00 €0,00 €0,00

I am trying to parse a non well formed DTD html file which i retrieve by a inputstream with JSOUP, and get all the data in the TD fields. How can i do that with JSoup? I already looked at the http://jsoup.org/cookbook/ but i should need som example to get it started.

Thank you in advance.

I already tried the saxparser but i can`t get the DTD to work.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-             strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="nl" lang="nl"> 
<TABLE class=personaltable cellSpacing=0 cellPadding=0> 
 <TBODY> 
  <TR class=alternativerow> 
   <TD>Nieuw beltegoed:</TD> 
   <TD>€ 1,00</TD></TR> 
  <TR> 
   <TD>Tegoed vorige periode:  
   <TD>€ 2,00</TD></TD></TR> 
  <TR class=alternativerow> 
   <TD>Tegoed tot 09-11-2011:  
   <TD>€ 10,00</TD></TD></TR> 
  <TR> 
   <TD> 
   <TD height=25></TD> 
  <TR class=alternativerow> 
   <TD>Verbruik sinds nieuw tegoed:</TD> 
   <TD>€ 0,33</TD></TR> 
  <TR> 
   <TD>Ongebruikt tegoed:</TD> 
   <TD>€ 12,00</TD></TR> 
  <TR class=alternativerow> 
   <TD class=f-Orange>Verbruik boven bundel:</TD> 
   <TD class=f-Orange>€ 0,00</TD></TR> 
  <TR> 
   <TD>Verbruik dat niet in de bundel zit*:</TD> 
   <TD>€ 0,00</TD></TR> 
  </TBODY> 
 </TABLE> 
</html> 

Edit: I am getting a force close, i need the JSoup in my AsyncTask. Here is the LOgcat:

10-20 21:07:36.679: ERROR/AndroidRuntime(1396): FATAL EXCEPTION: main
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): java.lang.NullPointerException
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at   com.sencide.AndroidLogin$MyTask.onPostExecute(AndroidLogin.java:276)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at com.sencide.AndroidLogin$MyTask.onPostExecute(AndroidLogin.java:1)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at android.os.AsyncTask.finish(AsyncTask.java:417)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at android.os.AsyncTask.access$300(AsyncTask.java:127)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at android.os.AsyncTask$InternalHandler.handleMessage(AsyncTask.java:429)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at android.os.Handler.dispatchMessage(Handler.java:99)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at android.os.Looper.loop(Looper.java:130)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at android.app.ActivityThread.main(ActivityThread.java:3835)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at java.lang.reflect.Method.invokeNative(Native Method)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at java.lang.reflect.Method.invoke(Method.java:507)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:847)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:605)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396):     at dalvik.system.NativeStart.main(Native Method)

Here is the AsyncTask code:

public class MyTask extends AsyncTask<String, Integer, String> {
    private Elements tdsFromSecondColumn=null;
}

protected String doInBackground(String... params) {
      InputStream inputStreamActivity = response.getEntity().getContent();

                BufferedReader reader = new BufferedReader(new InputStreamReader(inputStreamActivity));
                StringBuilder sb = new StringBuilder();
                String line = null;

                while ((line = reader.readLine()) != null) {
                    sb.append(line + "\n");
                }

                /******* CLOSE CONNECTION AND STREAM *******/

                System.out.println(sb);
                inputStreamActivity.close();

                String kpn;
                kpn = sb.toString();

                Document doc = Jsoup.parse(kpn);
                Elements tdsFromSecondColumn = doc.select("table.personaltable td:eq(1)");
}

@Override 
    protected void onPostExecute(String result) { 
        //publishProgress(false); 
        TextView tv = (TextView)findViewById(R.id.lbl_top);

        for (Element tdFromSecondColumn : tdsFromSecondColumn) { 
            //System.out.println(tdFromSecondColumn.text()); 
            tv.setText("");
            tv.setText(tdFromSecondColumn.text());
        }
}
}

解决方案

So, you have an InputStream and not an URL? You should then use the Jsoup#parse() method which takes an InputStream:

Document document = Jsoup.parse(inputStream, charsetName, baseUri);
// ...

The charsetName should be the charset the document is originally encoded in. You can leave it null to let Jsoup decide or fallback to UTF-8. The baseUri should be the URL from which the HTML was originally served. You can leave it null, you'll only not be able to resolve relative links.

But if you actually have the original URL, then you could also just use Jsoup#connect():

Document document = Jsoup.connect(url).get();
// ...

Regardless of the way you obtained the Document, you can use CSS selectors to select elements of interest in the document. See also the Jsoup cookbook on that subject. Here's an example which extracts all the data from the 2nd column of the <table> with a class name of personaltable:

Elements tdsFromSecondColumn = document.select("table.personaltable td:eq(1)");

for (Element tdFromSecondColumn : tdsFromSecondColumn) {
    System.out.println(tdFromSecondColumn.text());
}

which results in:

€ 1,00
€ 2,00
€ 10,00

€ 0,33
€ 12,00
€ 0,00
€ 0,00

这篇关于JSoup解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆