从成人网站中提取网页-如何超越网站协议? [英] pulling webpages from an adult site -- how to get past the site agreement?

查看:56
本文介绍了从成人网站中提取网页-如何超越网站协议?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Ruby从成人网站中解析一堆网页:

I'm trying to parse a bunch of webpages from an adult website using Ruby:


require 'hpricot'
require 'open-uri'

doc = Hpricot(open('random page on an adult website'))

但是,我最终得到的是最初的网站协议"页面,以确保您年满18岁,等等.

However, what I end up getting instead is that initial 'Site Agreement' page making sure that you're 18+, etc.

如何跳过站点协议并提取所需的网页? (如果有办法的话,任何语言都可以.)

How do I get past the Site Agreement and pull the webpages I want? (If there's a way to do it, any language is fine.)

推荐答案

您将不得不弄清楚该网站如何检测到访客已接受该协议.

You're going to have to figure out how the site detects that a visitor has accepted the agreement.

最明显的选择是cookie.当访问者接受该协议时,可能会将cookie发送到其浏览器,然后在每次后续请求时将cookie传递回该网站.

The most obvious choice would be cookies. Likely when a visitor accepts the agreement, a cookie is sent to their browser, which is then passed back to the site on every subsequent request.

您必须通过接受Cookie并将其与每个后续请求一起发送,来使脚本像访问者一样工作.这将需要您自己进行编程,以便首先请求接受协议"页面,找到cookie并将其存储以供使用.他们可能没有为协议使用特定的cookie,而是将其存储在会话中,在这种情况下,您只需要查找会话cookie.

You'll have to get your script to act like a visitor by accepting the cookie, and sending it with every subsequent request. This will require programming on your part to request the "accept agreement" page first, find the cookie, and store it for use. It's likely that they don't use a specific cookie for the agreement, but rather store it in a session, in which case you just need to find the session cookie.

这篇关于从成人网站中提取网页-如何超越网站协议?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆