解析HTML在Java的Andr​​oid应用程序 [英] Parsing html in java for an android app

查看:97
本文介绍了解析HTML在Java的Andr​​oid应用程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个Android应用程序,需要从网站和presents相关数据给用户(HTML刮)。该应用程序会下载源$ C ​​$ c和解析它,寻找相关的数据存储中的对象。我曾作过使用JSoup解析器,但它竟然在我的应用程序很慢。此外,这些库往往是相当大的,我想我的应用程序是轻量级的。

I'm writing an android app that takes relevant data from a website and presents it to the user (html scraping). The application downloads the source code and parses it, looking for relevant data to store in objects. I actually made a parser using JSoup, but it turned out to be really slow in my app. Also, these libraries tend to be rather large, and I want my app to be lightweight.

我试图解析所有的网页具有类似的结构,我知道我在寻找什么标签。所以,我想我也可以下载源$ C ​​$ c和一行行读,寻找相关数据,使用 String.equals 。例如,如果HTML是这样的:

The webpages I'm trying to parse all have a similar structure and I know exactly what tags I'm looking for. So I figured I might as well download the source code and read it line by line, looking for the relevant data, using String.equals. For example, if the html would look like this:

<textTag class="text">I want this text</textTag>

我会用类似的方法分析它:

I would parse it using methods like:

private void interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
        String text = s.substring(22, s.length() - 10);
    }
}

不过,我有关于建立连接(我见过的人使用 HTTPGET S,但我不完全知道如何从得到的数据非常小知识)。我搜索了很长一段时间寻找关于如何分析这样的信息,但大多数人往往求助于使用像JSoup,SAX等库做解析。

However, I have very little knowledge about setting up connections (I've seen people use HttpGets, but I'm not entirely sure how to get data from that). I've searched for quite some time looking for information on how to parse like this, but most people often resort to using libraries like JSoup, SAX, etc. to do parsing.

有谁恰好有关于如何做解析这样的,也许一个例子的一些信息?或者是一个坏主意来解析源$ C ​​$ C以这种方式?请给我你的意见。

Does anyone happen to have some information on how to do parsing like this, maybe an example? Or is it a bad idea to parse source code in this way? Please give me your opinion.

感谢您的时间。

推荐答案

下面是我会怎么做它:

        StringBuffer text = new StringBuffer();
        HttpURLConnection conn = null;
        InputStreamReader in = null;
        BufferedReader buff = null;
        try {
            URL page = new URL(
                    "http://example.com/");
// URLEncoder.encode(someparameter); use when passing params that may contain symbols or spaces use URLEncoder to encode it and conver space to %20...etc other wise you will get a 404
            conn = (HttpURLConnection) page.openConnection();
            conn.connect();
            /* use this if you need to
            int responseCode = conn.getResponseCode();

            if (responseCode == 401 || responseCode == 403) {
                // Authorization Error
                Log.e(tag, "Authorization Error");
                throw new Exception("Authorization Error");
            }

            if (responseCode >= 500 && responseCode <= 504) {
                // Server Error
                Log.e(tag, "Internal Server Error");
                throw new Exception("Internal Server Error");
            }*/
            in = new InputStreamReader((InputStream) conn.getContent());
            buff = new BufferedReader(in);
            String line = "anything";
            while (line != null) {
                line = buff.readLine();
            String found = interpretHtml(line);
            if(null != found)
                return found; // comment the previous 2 lines and this one if u need to load the whole html document.
                text.append(line + "\n");
            }
        } catch (Exception e) {
            Log.e(Standards.tag,
                    "Exception while getting html from website, exception: "
                            + e.toString() + ", cause: " + e.getCause()
                            + ", message: " + e.getMessage());
        } finally {
            if (null != buff) {
                try {
                    buff.close();
                } catch (IOException e1) {
                }
                buff = null;
            }
            if (null != in) {
                try {
                    in.close();
                } catch (IOException e1) {
                }
                in = null;
            }
            if (null != conn) {
                conn.disconnect();
                conn = null;
            }
        }
        if (text.toString().length() > 0) {
            return interpretHtml(text.toString()); // use this if you don't need to load the whole page.
        } else return null;
    }

private String interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
    return s.substring(22, s.length() - 10);
    }
    return null;
}

这篇关于解析HTML在Java的Andr​​oid应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆