使用Jsoup登录和抓取数据 [英] using Jsoup to sign in and crawl data

查看:176
本文介绍了使用Jsoup登录和抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Jsoup 抓取一个只有在我登录时才可用的网页。我想这意味着我需要在一个网页上登录并发送Cookie到

我阅读了一些较早的帖子,此处,然后写下列内容代码:

I want to use Jsoup to crawl a page that is only available when I signed in. I guess it means I need to sign in on one page and send cookies to another page.
I read some earlier post here and write the following code:

public static void main(String[] args) throws IOException {
    Connection.Response res = Jsoup.connect("login.yahoo.com")
        .data("login", "myusername", "passwd", "mypassword")
        .method(Method.POST)
        .execute();

Document doc=res.parse();
String sessionId = res.cookie("SESSIONID");

Document doc2 = Jsoup.connect("http://health.groups.yahoo.com/group/asthma/messages")
        .cookie("SESSIONID", sessionId)
        .get();

Elements Eles=doc2.getElementsByClass("message");

String content=Eles.first().text();

System.out.println(content);

我的问题是我如何知道我的cookie名称(即SESSIONID)在这里发送我的登录信息?我使用 .cookies()方法从登录页面获取所有Cookie:

My question is how I can know my cookie name (i.e. "SESSIONID") here for sending my login info? I used the .cookies() method to get all the cookies from the login page:


B

DK

YM

T

PH

Y

F

B
DK
YM
T
PH
Y
F

我一个一个尝试,但没有工作。我可以得到sessionId从其中一些,但我不能成功地获取节点从第二页,这意味着我没有成功登录。任何人可以给我一些建议吗?非常感谢!

I tried them one by one but none worked. I could get sessionId from some of them, but I could not successfully get nodes from the second page, which means I didn't successfully sign in. Could anyone give me some suggestions? Many thanks!

推荐答案

Ive也在使用jsoup登录网站。

Ive struggled with logging in to websites with jsoup also.

我想出的是一个混合的selenium webdriver和jsoup。

What i came up with was a hybrid of selenium webdriver, and jsoup.

Webdriver可以远程控制浏览器,通常用于测试目的。

Webdriver can remote control a browser, typically this is used for testing purposes.

对于我的应用程序,不希望浏览器可见,并在屏幕上搞砸。所以我使用了silentwebdriver:HtmlUnitDriver。你可以使用这行代码实例化:

For my application, it was not desirable to have the browser visible, and messing about on the screen. So I have used the "silent" webdriver: HtmlUnitDriver instead. You can instantiate this using this line of code:

HtmlUnitDriver driver = new HtmlUnitDriver(true); // true meaning javascript support (Using rhino i be leave)

现在登录网站:

String baseUrl = "http://www.thesite.com";

driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);

driver.get(baseUrl);

driver.findElement(By.id("TextBoxUser")).clear();
driver.findElement(By.id("TextBoxUser")).sendKeys("username");
driver.findElement(By.id("TextBoxPass")).clear();
driver.findElement(By.id("TextBoxPass")).sendKeys("password");
driver.findElement(By.id("Button1")).click();

获取页面内容:

String htmlContent = driver.getPageSource();

开始使用jsoup:

Document document = Jsoup.parse(htmlContent);

这对我很有用。

Steffn Otto Jensen

Steffn Otto Jensen

这篇关于使用Jsoup登录和抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆