获取URL的HTML并进行解析，以便可以离线查看网站 [英] Get HTML of URL and parse it so the website can be viewed offline

查看：149 发布时间：2019/6/17 21:10:53 Java HTML Eclipse Parsing netbeans

本文介绍了获取URL的HTML并进行解析，以便可以离线查看网站的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

大家好，

今天我有一些与html解析有关的业务.要求的结果是:使用java.net.URL类从http://www.google.com/获取所有html内容，并设置一个文件，该文件可用于离线查看网站.原来最大的问题是从< img></img>中获取"诸如src的html元素属性.标签，来自
的href 标签等.到目前为止，我已经通过使用正则表达式和BufferedReader/Writer类获得了src属性.代码示例:

Hello all,

today I had some business with html parsing. The requested result was: using the java.net.URL class get all html content from http://www.google.com/ and set up a file which can be used to view the website offline. The greatest problem turned out to be "fetching" the html elements attributes like src from an <img></img> tag, href from an
tag etc. So far I have got to the src attribute by using regular expressions and BufferedReader/Writer classes. A code sample:

URL google = new URL("http://www.google.com/");
        BufferedReader in = new BufferedReader(new InputStreamReader(google
                         .openStream()));
        BufferedWriter wr;
        String s = null;
        Pattern p;
        
        p = Pattern.compile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE);
        Matcher m;
        try {
            wr = new BufferedWriter(new FileWriter("D:/HTMLFile.txt"));
            while ((s = in.readLine()) != null) {
                    m = p.matcher(s);
                    wr.write(s);
                while(m.find()) {
                    System.out.println(m.group(1));
                }
            }
            in.close();
        } catch (IOException ex) {
            Logger.getLogger(JavaNetworking.class.getName()).log(Level.SEVERE, null, ex);
        }

对于此特定URL，输出为:"/textinputassistant/tia.png"

我想问的是，有人可以提供一个更好的例子吗?我在各种论坛上都读到regex + java是一个可笑的怪物.我想到的是一种算法，可以减轻经验丰富的程序员的负担，与我不同:)...就在这里.
-从URL中读取所有html
-复制到字符串变量
-在字符串中搜索< img"
-当< img"> -复制到新的字符串变量
-搜索"src"或"href"属性
-提取属性值(System.out.println("..")暂时可以正常使用)
我认为这是一个防止白痴的问题，因为我认为这样可以解决问题，但我仍然认为最好是从由更大的专业人士组成的社区中寻求帮助:)

For this particular URL the output is: "/textinputassistant/tia.png"

What I wanted to ask, is can someone give a better example on how to do this? I read on various forums that regex + java is a hidious monster, sort of speak. I have an algorithm in mind that could lighten stuff up for an experienced programmer, unlike me :)...here it is.
- read all html from the URL
- copy to a string variable
- search in string for "<img"
- when "<img"> - copy to new string variable
- search for "src" or "href" attribute
- extract the attributes value (System.out.println("..") will do just fine for now)
I see this is an idiot-proof problem since I think that this could work out just fine like this, but still I think it''s better to ask for an oppinion from a community made of waaay bigger professionals :)

获取URL的HTML并进行解析，以便可以离线查看网站 [英] Get HTML of URL and parse it so the website can be viewed offline

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

获取URL的HTML并进行解析，以便可以离线查看网站 [英] Get HTML of URL and parse it so the website can be viewed offline

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭