使用 R 抓取受密码保护的网站 [英] Web scraping password protected website using R

查看:54
本文介绍了使用 R 抓取受密码保护的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 R 对 yammer 数据进行网络抓取,但为此首先必须登录此页面(这是我创建的应用程序的身份验证).

https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPj

一旦我登录到此页面,我就可以获取 yammer 数据,但所有这些都在浏览器中通过标准 yammer url(https://www.yammer.com/api/v1/messages/received.json)

我已经阅读了类似的问题并尝试了建议,但仍然无法解决这个问题.

我尝试过使用 httr、RSelenium、rvest+Selector 小工具.

这里的最终目标是在 R 中做所有事情(获取数据、清理、情感分析……清理和情感分析部分已经完成,但截至目前,获取数据部分是手动的,我想通过处理来实现自动化它来自 R)

1.使用httr试用:

usinghttr<- GET(https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg",认证(用户名",密码"))

对应结果:响应 [https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg]日期:2015-04-27 12:25状态:200内容类型:文本/html;字符集=utf-8大小:15.7 KB该页面内容显示已打开登录页面但未通过身份验证.

2.使用选择器小工具 + rvest 进行试用

我尝试使用此方法抓取维基百科,但无法将其应用于 yammer,因为在调用 selctor gadget 提供的 html 标记之前需要进行身份验证.

3.试用RSelenium

使用标准浏览器和 phantomjs 进行了尝试,但出现了一些错误

<代码>>启动服务器()

<块引用>

remDr <- remoteDriver$new()

remDr$open()[1]连接到远程服务器"RCurl 调用中的未定义错误.queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) 中的错误:

<代码>>pJS <-幻影()

phantom() 中的错误:找不到 PhantomJS 二进制文件.

解决方案

我还花了很长时间设法从 R 内部访问受密码保护的站点.最后,我设法通过将凭据作为 html 表单提交来做到这一点.我快速浏览了 Yammer 上的登录页面,这似乎与我设法访问的情况类似.

这是我使用的代码.您需要使其适应您的上下文:您首先在登录页面上启动会话,到达收集 Id 和密码的表单,最后提交表单.我认为在你的情况下,下面的代码可以工作:

session <- html_session("https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg")login_form <- session %>% html_nodes("form") %>%.... %>% #Instructions 引导您进入登录表单,例如提取物2(1)html_form() %>%set_values(`login` = YourId,`password` = YourPasswd)Logged_in=会话 %>% submit_form(login_form))

logged_in 应该包含登录后的会话信息.

BR

i would like to web scrape yammer data using R,but in order to do so first il have to login to this page,(which is authentication for an app that i created).

https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg

I am able to get the yammer data once i login to this page but all this is in browser by standard yammer urls (https://www.yammer.com/api/v1/messages/received.json)

I have read through similar questions and tried the suggestions but still cant get through this issue.

I have tried using httr,RSelenium,rvest+Selector gadget.

End goal here is to do everything in R (getting data,cleaning,sentiment analysis...the cleaning and sentiment analysis part is done but as of now the getting data part is manual and i would like to automate that by handling it from R)

1.Trial using httr:

usinghttr<- GET("https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg",
     authenticate("Username", "Password"))

corresponding Result : Response [https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg] Date: 2015-04-27 12:25 Status: 200 Content-Type: text/html; charset=utf-8 Size: 15.7 kB content of this page showed that it has opened the login page but didnt authenticate.

2.Trial using selector gadget + rvest

i tried scraping wikipedia using this method but couldnt apply it to yammer as authentication would be required prior to calling the html tag that selctor gadget gives.

3.Trial using RSelenium

tried this using the standard browsers and phantomjs but got some errors

> startServer()

remDr <- remoteDriver$new()

remDr$open() [1] "Connecting to remote server" Undefined error in RCurl call. Error in queryRD(paste0(serverURL, "/session"), "POST", qdata = toJSON(serverOpts)) :

> pJS <- phantom()

Error in phantom() : PhantomJS binary not located.

解决方案

I also spent very long time to manage to access password-protected sites from inside R. Finally I managed to do so by submitting the credentials as an html form. I had a quick look to the login page on Yammer and it seems similar to the case where I managed to have access.

Here is the code that I used. You need to adapt it to your context: You first start a session on the login page, you reach to the form that collects the Id and the password and finally you submit the form. I think in your case, the code below would work:

session <- html_session("https://www.yammer.com/dialog/authenticate?client_id=iVGCK1tOhbZGS7zC8dPjg")
    login_form <- session %>% html_nodes("form") %>%
    .... %>%  #Instructions that lead you to the login form, e.g. extract2(1)
                    html_form() %>%
                    set_values(`login` = YourId,`password` = YourPasswd)  
     Logged_in=session %>%  submit_form(login_form))

logged_in should contains the session information after logging in.

BR

这篇关于使用 R 抓取受密码保护的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆