如何使用用户名和密码登录网站后抓取网站 [英] How to crawl a website after login in it with username and password

查看:265
本文介绍了如何使用用户名和密码登录网站后抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写过一个网络抓取工具,用键盘抓取网站,但我想登录我指定的网站并按关键字过滤信息。如何实现。我到目前为止发布了我的代码。

I have written a webcrawler that crawls a website with keyward but i want to login to my specified website and filter information by keyword.How to achive that. i posting my code so far i have done .

public class DB {

public Connection conn = null;

public DB() {
    try {
        Class.forName("com.mysql.jdbc.Driver");
        String url = "jdbc:mysql://localhost:3306/test";
        conn = DriverManager.getConnection(url, "root","root");
        System.out.println("conn built");
    } catch (SQLException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    }
}

public ResultSet runSql(String sql) throws SQLException {
    Statement sta = conn.createStatement();
    return sta.executeQuery(sql);
}

public boolean runSql2(String sql) throws SQLException {
    Statement sta = conn.createStatement();
    return sta.execute(sql);
}

@Override
protected void finalize() throws Throwable {
    if (conn != null || !conn.isClosed()) {
        conn.close();
    }
}
}


public class Main {
public static DB db = new DB();

public static void main(String[] args) throws SQLException, IOException {
    db.runSql2("TRUNCATE Record;");
    processPage("http://m.naukri.com/login");
}

public static void processPage(String URL) throws SQLException, IOException{
    //check if the given URL is already in database;
    String sql = "select * from Record where URL = '"+URL+"'";
    ResultSet rs = db.runSql(sql);
    if(rs.next()){

    }else{
        //store the URL to database to avoid parsing again
        sql = "INSERT INTO  `test`.`Record` " + "(`URL`) VALUES " + "(?);";
        PreparedStatement stmt = db.conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
        stmt.setString(1, URL);
        stmt.execute();

        //get useful information
        Connection.Response res = Jsoup.connect("http://www.naukri.com/").data("username","jeet.chatterjee.88@gmail.com","password","Letmein321")
                 .method(Method.POST)
                    .execute();  
        //http://m.naukri.com/login
        Map<String, String> loginCookies = res.cookies();
        Document doc = Jsoup.connect("http://m.naukri.com/login")
                  .cookies(loginCookies)
                  .get();

        if(doc.text().contains("")){
            System.out.println(URL);
        }

        //get all links and recursively call the processPage method
        Elements questions = doc.select("a[href]");
        for(Element link: questions){
            if(link.attr("abs:href").contains("naukri.com"))
                processPage(link.attr("abs:href"));
        }
    }
}
}

表结构也

 CREATE TABLE IF NOT EXISTS `Record` (
 `RecordID` INT(11) NOT NULL AUTO_INCREMENT,
 `URL` text NOT NULL,
  PRIMARY KEY (`RecordID`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

现在我想使用我的用户名和密码进行抓取,以便抓取工具可以登录该网站动态和抓取信息基于关键字..
让我说我的用户名是lucifer&密码是lucifer123

Now i want to use my username and password for that crawling so that crawler can log in to the site dynamically and crawl infomation on the basis of keyword.. Lets say my username is lucifer & password is lucifer123

推荐答案

您的方法是无状态Web访问。通常适用于Web服务,而网站都是有状态的。 u验证一次,之后,他们使用存储在cookie中的会话密钥对您进行身份验证。所以这是必需的。你必须发送浏览器发送的参数。尝试使用firebug监控您的浏览器发送到网站的内容,并在您的代码中重现该内容

your approach is for stateless web access. usually works for web services, while sites all stateful. u authenticate once and after that, they use the session key stored in your cookie to authenticate you. so it is required. u must send parameters that your browser is sending. try monitoring what your browser send to site with firebug, and reproduce that in your code

- 更新 -

Jsoup.connect("url")
  .cookie("cookie-name", "cookie-value")
  .header("header-name", "header-value")
  .data("data-name","data-value");

你可以添加多个cookie |标题|数据。并且有从 Map 添加值的功能。

u can add multi cookie | header | data. and there is function for adding values from Map.

找出必须设置的内容,添加fire bug到你的浏览器,他们都有他们的默认开发者控制台,可以用 F12 开始。转到网址你想获取数据,只需将所有内容添加到你的jsoup请求中。
i从您的网站添加了一些图片结果

to find out what must be set, add fire bug to your browser, they all have their default developer console which can be started with F12. go to the url u want to get data and just add all thing in there to your jsoup request. i added some images from your site result

我用红色标记了重要部分。

i marked important part in red.

你可以在你的代码中获得所需的cookie,将这些信息发送到网站并从中获取cookie获得response.cookies你将这些cookie附加到你提出的每个请求;)

u can get required cookies in your code with sending these info to site and get cookie from that and after getting response.cookies you attach these cookies to every request u make ;)

ps:尽快更改你的密码

p.s: change your password A.S.A.P

这篇关于如何使用用户名和密码登录网站后抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆