获取URL的HTML并进行解析,以便可以离线查看网站 [英] Get HTML of URL and parse it so the website can be viewed offline

查看:149
本文介绍了获取URL的HTML并进行解析,以便可以离线查看网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

今天我有一些与html解析有关的业务.要求的结果是:使用java.net.URL类从http://www.google.com/获取所有html内容,并设置一个文件,该文件可用于离线查看网站.原来最大的问题是从< img></img>中获取"诸如src的html元素属性.标签,来自
的href 标签等.到目前为止,我已经通过使用正则表达式和BufferedReader/Writer类获得了src属性.代码示例:

Hello all,

today I had some business with html parsing. The requested result was: using the java.net.URL class get all html content from http://www.google.com/ and set up a file which can be used to view the website offline. The greatest problem turned out to be "fetching" the html elements attributes like src from an <img></img> tag, href from an
tag etc. So far I have got to the src attribute by using regular expressions and BufferedReader/Writer classes. A code sample:

URL google = new URL("http://www.google.com/");
        BufferedReader in = new BufferedReader(new InputStreamReader(google
                         .openStream()));
        BufferedWriter wr;
        String s = null;
        Pattern p;
        
        p = Pattern.compile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE);
        Matcher m;
        try {
            wr = new BufferedWriter(new FileWriter("D:/HTMLFile.txt"));
            while ((s = in.readLine()) != null) {
                    m = p.matcher(s);
                    wr.write(s);
                while(m.find()) {
                    System.out.println(m.group(1));
                }
            }
            in.close();
        } catch (IOException ex) {
            Logger.getLogger(JavaNetworking.class.getName()).log(Level.SEVERE, null, ex);
        }


对于此特定URL,输出为:"/textinputassistant/tia.png"

我想问的是,有人可以提供一个更好的例子吗?我在各种论坛上都读到regex + java是一个可笑的怪物.我想到的是一种算法,可以减轻经验丰富的程序员的负担,与我不同:)...就在这里.
-从URL中读取所有html
-复制到字符串变量
-在字符串中搜索< img"
-当< img"> -复制到新的字符串变量
-搜索"src"或"href"属性
-提取属性值(System.out.println("..")暂时可以正常使用)
我认为这是一个防止白痴的问题,因为我认为这样可以解决问题,但我仍然认为最好是从由更大的专业人士组成的社区中寻求帮助:)


For this particular URL the output is: "/textinputassistant/tia.png"

What I wanted to ask, is can someone give a better example on how to do this? I read on various forums that regex + java is a hidious monster, sort of speak. I have an algorithm in mind that could lighten stuff up for an experienced programmer, unlike me :)...here it is.
- read all html from the URL
- copy to a string variable
- search in string for "<img"
- when "<img"> - copy to new string variable
- search for "src" or "href" attribute
- extract the attributes value (System.out.println("..") will do just fine for now)
I see this is an idiot-proof problem since I think that this could work out just fine like this, but still I think it''s better to ask for an oppinion from a community made of waaay bigger professionals :)

推荐答案

请在此处阅读: RegEx教程 [ ^ ] @ vogella.com

并且请在这里做一些研究,我们过去曾经有过这样的事情.
Please read here: RegEx Tutorial[^] @ vogella.com

And please do some research here, we had such things in the past.


这篇关于获取URL的HTML并进行解析,以便可以离线查看网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆