爬行要求身份验证的网站 [英] Crawling websites which ask for authentication

查看:58
本文介绍了爬行要求身份验证的网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我按照这个 https://wiki.apache.org/nutch/HttpAuthenticationSchemes 链接查看通过提供用户名和密码来抓取几个网站

I followed this https://wiki.apache.org/nutch/HttpAuthenticationSchemes link for crawling few websites by providing username and password

解决方法:我已经在 httpclient-auth.xml 文件中设置了 auth-configuration:

Work around:I have set the auth-configuration in httpclient-auth.xml file:

<auth-configuration>
<credentials username="xyz" password="xyz">
<default realm="domain" />
<authscope host="www.gmail.com" port="80"/>
</credentials>
</auth-configuration>

ii) 在 nutch-site.xml 和 nutch-default.xml 中定义 httpclient 属性

ii)Define httpclient property in both nutch-site.xml and nutch-default.xml

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

iii) 还在 nutch-site.xml 中定义了 auth 配置文件.

iii) Also have defined the auth configuration file in nutch-site.xml.

<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.
</description>

我无法抓取它并且没有错误..

I'm not able to crawl it and getting no error..

要求:我想抓取 gmail.com 或 yahoomail.com 之类的网站或任何要求身份验证的网站.

Requirements: I want to crawl websites like gmail.com or yahoomail.com or anything which asks for authentication.

我哪里出错了,我是不是选择了错误的网站进行抓取

Where am i going wrong, am i choosing wrong websites for crawling

(如果是,请向我提供要求身份验证的网站,我将注册)

( if yes please provide me the websites which asks for authentication I'll register for it)

(如果没有,我如何抓取我的 Gmail 或 Facebook 帐户)

(if no how can i crawl my gmail or facebook accounts)

推荐答案

可以帮助您解决此问题的几点:

Few points which will help you in resolving this issue:

1) 是的,您选择了错误的网站来抓取和索引尝试一些不同的网站.

1) Yes you have chosen wrong website to crawl and index try some different websites.

2) Nutch 仅支持NTLM、Basic 或 Digest 身份验证.它不支持基于表单的身份验证.您尝试使用的网站具有基于表单的身份验证.

2) Nutch only support NTLM, Basic or Digest authentication. It do not support the Form Based Authentication. The sites that you are trying use have Form based Authentication.

3) 要实施基于表单的身份验证,您必须自定义您的 Nutch 代码.

3) To implement Form Based Authentication you will have to customize your Nutch code.

我相信以下 2 个链接将帮助您在您面临的这个问题上取得一些进展:

I am sure following 2 links will help you in making some progress in this issue that you are facing:

http://technical-fundas.blogspot.in/2014/05/nutch-solr-formed-based-authentication.html

http://Technical-fundas.blogspot.in/2014/06/how-to-configure-nutch-in-eclipse-for.html

这篇关于爬行要求身份验证的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆