获取URL的HTML并进行解析,以便可以离线查看网站 [英] Get HTML of URL and parse it so the website can be viewed offline
问题描述
大家好,
今天我有一些与html解析有关的业务.要求的结果是:使用java.net.URL类从http://www.google.com/获取所有html内容,并设置一个文件,该文件可用于离线查看网站.原来最大的问题是从< img></img>中获取"诸如src的html元素属性.标签,来自
的href
标签等.到目前为止,我已经通过使用正则表达式和BufferedReader/Writer类获得了src属性.代码示例:
Hello all,
today I had some business with html parsing. The requested result was: using the java.net.URL class get all html content from http://www.google.com/ and set up a file which can be used to view the website offline. The greatest problem turned out to be "fetching" the html elements attributes like src from an <img></img> tag, href from an
tag etc. So far I have got to the src attribute by using regular expressions and BufferedReader/Writer classes. A code sample:
URL google = new URL("http://www.google.com/");
BufferedReader in = new BufferedReader(new InputStreamReader(google
.openStream()));
BufferedWriter wr;
String s = null;
Pattern p;
p = Pattern.compile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE);
Matcher m;
try {
wr = new BufferedWriter(new FileWriter("D:/HTMLFile.txt"));
while ((s = in.readLine()) != null) {
m = p.matcher(s);
wr.write(s);
while(m.find()) {
System.out.println(m.group(1));
}
}
in.close();
} catch (IOException ex) {
Logger.getLogger(JavaNetworking.class.getName()).log(Level.SEVERE, null, ex);
}
对于此特定URL,输出为:"/textinputassistant/tia.png"
我想问的是,有人可以提供一个更好的例子吗?我在各种论坛上都读到regex + java是一个可笑的怪物.我想到的是一种算法,可以减轻经验丰富的程序员的负担,与我不同:)...就在这里.
-从URL中读取所有html
-复制到字符串变量
-在字符串中搜索< img"
-当< img"> -复制到新的字符串变量
-搜索"src"或"href"属性
-提取属性值(System.out.println("..")暂时可以正常使用)
我认为这是一个防止白痴的问题,因为我认为这样可以解决问题,但我仍然认为最好是从由更大的专业人士组成的社区中寻求帮助:)
For this particular URL the output is: "/textinputassistant/tia.png"
What I wanted to ask, is can someone give a better example on how to do this? I read on various forums that regex + java is a hidious monster, sort of speak. I have an algorithm in mind that could lighten stuff up for an experienced programmer, unlike me :)...here it is.
- read all html from the URL
- copy to a string variable
- search in string for "<img"
- when "<img"> - copy to new string variable
- search for "src" or "href" attribute
- extract the attributes value (System.out.println("..") will do just fine for now)
I see this is an idiot-proof problem since I think that this could work out just fine like this, but still I think it''s better to ask for an oppinion from a community made of waaay bigger professionals :)
推荐答案
请在此处阅读: RegEx教程 [ ^ ] @ vogella.com
并且请在这里做一些研究,我们过去曾经有过这样的事情.
Please read here: RegEx Tutorial[^] @ vogella.com
And please do some research here, we had such things in the past.
这篇关于获取URL的HTML并进行解析,以便可以离线查看网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!