使用Jsoup登录和抓取数据 [英] using Jsoup to sign in and crawl data
问题描述
我想使用 Jsoup
抓取一个只有在我登录时才可用的网页。我想这意味着我需要在一个网页上登录并发送Cookie到
我阅读了一些较早的帖子,此处,然后写下列内容代码:
I want to use Jsoup
to crawl a page that is only available when I signed in. I guess it means I need to sign in on one page and send cookies to another page.
I read some earlier post here and write the following code:
public static void main(String[] args) throws IOException {
Connection.Response res = Jsoup.connect("login.yahoo.com")
.data("login", "myusername", "passwd", "mypassword")
.method(Method.POST)
.execute();
Document doc=res.parse();
String sessionId = res.cookie("SESSIONID");
Document doc2 = Jsoup.connect("http://health.groups.yahoo.com/group/asthma/messages")
.cookie("SESSIONID", sessionId)
.get();
Elements Eles=doc2.getElementsByClass("message");
String content=Eles.first().text();
System.out.println(content);
我的问题是我如何知道我的cookie名称(即SESSIONID)在这里发送我的登录信息?我使用 .cookies()
方法从登录页面获取所有Cookie:
My question is how I can know my cookie name (i.e. "SESSIONID") here for sending my login info? I used the .cookies()
method to get all the cookies from the login page:
B
DK
YM
T
PH
Y
F
B
DK
YM
T
PH
Y
F
我一个一个尝试,但没有工作。我可以得到sessionId从其中一些,但我不能成功地获取节点从第二页,这意味着我没有成功登录。任何人可以给我一些建议吗?非常感谢!
I tried them one by one but none worked. I could get sessionId from some of them, but I could not successfully get nodes from the second page, which means I didn't successfully sign in. Could anyone give me some suggestions? Many thanks!
推荐答案
Ive也在使用jsoup登录网站。
Ive struggled with logging in to websites with jsoup also.
我想出的是一个混合的selenium webdriver和jsoup。
What i came up with was a hybrid of selenium webdriver, and jsoup.
Webdriver可以远程控制浏览器,通常用于测试目的。
Webdriver can remote control a browser, typically this is used for testing purposes.
对于我的应用程序,不希望浏览器可见,并在屏幕上搞砸。所以我使用了silentwebdriver:HtmlUnitDriver。你可以使用这行代码实例化:
For my application, it was not desirable to have the browser visible, and messing about on the screen. So I have used the "silent" webdriver: HtmlUnitDriver instead. You can instantiate this using this line of code:
HtmlUnitDriver driver = new HtmlUnitDriver(true); // true meaning javascript support (Using rhino i be leave)
现在登录网站:
String baseUrl = "http://www.thesite.com";
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
driver.get(baseUrl);
driver.findElement(By.id("TextBoxUser")).clear();
driver.findElement(By.id("TextBoxUser")).sendKeys("username");
driver.findElement(By.id("TextBoxPass")).clear();
driver.findElement(By.id("TextBoxPass")).sendKeys("password");
driver.findElement(By.id("Button1")).click();
获取页面内容:
String htmlContent = driver.getPageSource();
开始使用jsoup:
Document document = Jsoup.parse(htmlContent);
这对我很有用。
Steffn Otto Jensen
Steffn Otto Jensen
这篇关于使用Jsoup登录和抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!