获取java的网站源码 [英] Get source of website in java

查看:101
本文介绍了获取java的网站源码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用java来获取网站的来源(安全),然后解析该网站中的链接。我已经找到了如何连接到该URL,但是我怎样才能轻松获得源代码,更喜欢DOM Document oso,我可以轻松获得我想要的信息。



<或者是否有更好的方式连接到https站点,获取源(我需要做的是获取数据表......非常简单)然后这些链接是我要下载的文件。



我希望它是FTP,但这些是存储在我的tivo上的文件(我想以编程方式将它们下载到我的电脑上(

解决方案

您可以获得低级别,只需使用套接字请求它。在java中它看起来像

  // Arg [0] =主机名
// Arg [1] =类似index.html的文件
public static void main(String [] args)抛出异常{
SSLSocketFactory factory =(SSLSocketFactory)SSLSocketFactory.getDefault();

SSLSocket sslsock =(SSLSocket)factory.createSocket(args [0],443);

SSLSession session = sslsock.getSession () ;
X509证书;
try {
cert =(X509Certificate)session.getPeerCertificates()[0];
} catch(SSLPeerUnverifiedException e){
System.err.println(session.getPeerHost()+没有提供有效的证书。);
返回;
}

//现在使用安全套接字就像常规套接字一样读取页面。
PrintWriter out = new PrintWriter(sslsock.getOutputStream());
out.write(GET+ args [1] +HTTP / 1.0 \\\\\\\ n);
out.flush();

BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
字符串行;
String regExp =。*< a href = \(。*)\>。*;
模式p = Pattern.compile(regExp,Pattern.CASE_INSENSITIVE);

while((line = in.readLine())!= null){
//使用Oscar的RegEx。
Matcher m = p.matcher(line);
if(m.matches()){
System.out.println(m.group(1));
}
}

sslsock.close();
}


I would like to use java to get the source of a website (secure) and then parse that website for links that are in it. I have found how to connect to that url, but then how can i easily get just the source, preferraby as the DOM Document oso that I could easily get the info I want.

Or is there a better way to connect to https site, get the source (which I neet to do to get a table of data...its pretty simple) then those links are files i am going to download.

I wish it was FTP but these are files stored on my tivo (i want to programmatically download them to my computer(

解决方案

You can get low level and just request it with a socket. In java it looks like

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}

这篇关于获取java的网站源码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆