在 r 中抓取受密码保护的论坛 [英] Scraping password protected forum in r

查看:33
本文介绍了在 r 中抓取受密码保护的论坛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在登录脚本时遇到问题.尽管我在 stackoverflow 上找到了所有其他好的答案,但没有一个解决方案对我有用.

I have a problem with logging in in my script. Despite all other good answers that I found on stackoverflow, none of the solutions worked for me.

我正在为我的博士研究抓取一个网络论坛,其 URL 是 http://forum.axishistory.com.

I am scraping a web forum for my PhD research, its URL is http://forum.axishistory.com.

我想要抓取的网页是会员列表 - 一个列出所有会员个人资料链接的页面.只有登录后才能访问会员列表.如果您尝试在未登录的情况下访问会员列表,它会显示登录表单.

The webpage I want to scrape is the memberlist - a page that lists the links to all member profiles. One can only access the memberlist if logged in. If you try to access the memberlist without logging in, it shows you the log in form.

会员列表的网址是这样的:http://forum.axishistory.com/memberlist.php.

The URL of the memberlist is this: http://forum.axishistory.com/memberlist.php.

我尝试了 httr 包:

I tried the httr-package:

library(httr)
members  <-  GET("http://forum.axishistory.com/memberlist.php", authenticate("username", "password"))
members_html <- html(members)

输出是登录表单.

然后我尝试了 RCurl:

Then I tried RCurl:

library(RCurl)
members_html <- htmlParse(getURL("http://forum.axishistory.com/memberlist.php", userpwd = "username:password"))
members_html

输出是登录表单 - 再次.

The output is the log in form - again.

然后我尝试了本主题中的 list() 函数 - 抓取受密码保护的网站在 R 中:

Then i tried the list() function from this topic - Scrape password-protected website in R :

handle <- handle("http://forum.axishistory.com/")
path   <- "ucp.php?mode=login"

login <- list(
  amember_login = "username"
  ,amember_pass  = "password"
  ,amember_redirect_url = 
    "http://forum.axishistory.com/memberlist.php"
)

response <- POST(handle = handle, path = path, body = login)

再说一遍!输出是登录表单.

and again! The output is the log in form.

接下来我正在研究的是 RSelenium,但在进行了所有这些尝试之后,我试图弄清楚我是否可能遗漏了某些东西(可能是完全显而易见的东西).

The next thing I am working on is RSelenium, but after all these attempts I am trying to figure out whether I am probably missing something (probably something completely obvious).

我在这里查看了其他相关帖子,但不知道如何将代码应用于我的案例:

I have looked at other relevant posts in here, but couldn't figure out how to apply the code to my case:

如何使用 R 从需要 cookie 的 SSL 页面下载压缩文件

在 R 中抓取受密码保护的网站

如何使用 R 从需要 cookie 的 SSL 页面下载压缩文件

https://stackoverflow.com/questions/27485311/scrape-password-protected-https-website-in-r

使用 R 的网页抓取密码保护网站

推荐答案

感谢 Simon 我在这里找到了答案:使用 rvest 或 httr 登录网页上的非标准表单

Thanks to Simon I found the answer here: Using rvest or httr to log in to non-standard forms on a webpage

library(rvest)
url       <-"http://forum.axishistory.com/memberlist.php"
pgsession <-html_session(url)

pgform    <-html_form(pgsession)[[2]]

filled_form <- set_values(pgform,
                      "username" = "username", 
                      "password" = "password")

submit_form(pgsession,filled_form)
memberlist <- jump_to(pgsession, "http://forum.axishistory.com/memberlist.php")

page <- html(memberlist)

usernames <- html_nodes(x = page, css = "#memberlist .username") 

data_usernames <- html_text(usernames, trim = TRUE) 

这篇关于在 r 中抓取受密码保护的论坛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆