爬取需要认证的网站 [英] crawling website that needs authentication

查看：32 发布时间：2021/9/22 20:29:17 web-crawler

本文介绍了爬取需要认证的网站的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

How would I write a simple script (in cURL/python/ruby/bash/perl/java) that logs in to okcupid and tallies how many messages I've received each day?

输出将类似于:

1/21/2011    1 messages
1/22/2011    0 messages
1/23/2011    2 messages
1/24/2011    1 messages

主要问题是我以前从未写过网络爬虫.我不知道如何以编程方式登录到像 okcupid 这样的网站.如何在加载不同页面时保持身份验证?等等.

The main issue is that I have never written a web crawler before. I have no idea how to programmatically log in to a site like okcupid. How do you make the authentication persist while loading different pages? etc..

一旦我可以访问原始 HTML，我就可以通过正则表达式和地图等进行操作.

Once I get access to the raw HTML, I'll be okay via regex and maps etc.

推荐答案

这是使用 cURL 下载收件箱第一页的解决方案.正确的解决方案将迭代每页消息的最后一步.$USERNAME 和 $PASSWORD 需要填写您的信息.

Here's a solution using cURL that downloads the first page of the inbox. A proper solution will iterate the last step for each page of messages. $USERNAME and $PASSWORD need to be filled in with your info.

#!/bin/sh

## Initialize the cookie-jar
curl --cookie-jar cjar --output /dev/null https://www.okcupid.com/login

## Login and save the resulting HTML file as loginResult.html (for debugging purposes)
curl --cookie cjar --cookie-jar cjar \
  --data 'dest=/?' \
  --data 'username=$USERNAME' \
  --data 'password=$PASSWORD' \
  --location \
  --output loginResult.html \
    https://www.okcupid.com/login

## Download the inbox and save it as inbox.html
curl --cookie cjar \
  --output inbox.html \
  http://www.okcupid.com/messages

此技术在关于 cURL 的视频教程中进行了解释.

This technique explained in a video tutorial about cURL.

这篇关于爬取需要认证的网站的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

爬取需要认证的网站 [英] crawling website that needs authentication

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

爬取需要认证的网站 [英] crawling website that needs authentication

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭