抓取公共网站 - 比Google Search Appliance更难? [英] Crawling public web sites - more difficult than Google Search Appliance?

查看:68
本文介绍了抓取公共网站 - 比Google Search Appliance更难?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚安装了Search Server 2008并且有点吃惊,因为它在抓取网站时不是更直观(或者我缺少一些明显的东西!)。我之前已经设置了其他抓取工具,就像谷歌的Search Appliance一样,而且它几乎是开箱即用的。

我正在尝试抓取一个需要cookie的经典ASP公共网站,但我正在获取: >
"
无法在远程服务器上访问该项目,因为其地址语法无效。 "

我认为*正在发生的事情是爬虫在获取cookie时遇到问题......但是我找不到更详细的日志?

我是设置爬网规则:
- 包含此路径中的所有项目,抓取复杂网址
- 尝试使用"抓取Cookie",但它希望此网站不使用的表单提交。我试图模拟一个,但输入数据到另一个表单字段但没有骰子:它说它检索到cookie但是当我点击确定它表示表单凭据不正确。

我的问题是 - 多少配置我是否需要为此
经典的ASP公共网站做些什么,需要使用cookies,是否有指导帮助我?

谢谢,Dan

I've just installed Search Server 2008 and am a bit taken aback that it's not more intuitive when crawling web sites (or I'm missing something obvious!). I've set up other crawlers before like Google's Search Appliance and it's pretty much out-of-the-box.

I'm trying to crawl a classic ASP public website that requires cookies but am getting:

"
The item could not be accessed on the remote server because its address has an invalid syntax."

What I *think* is happening is the crawler is having trouble getting the cookie... but I can't find where I get a more detailed log?

I've set up a crawl rule:
- Include all items in this path, Crawl complex URLs
- Tried to 'Use cookie for crawling' but it expects a form submission which this site doesn't use. I've tried to mock one but entering data into another form field but no dice: it says it's retrieved the cookie but when I click ok it says the form credentials are incorrect.

My question is- how much config do I have to do for this
classic ASP public website that requires cookies and is there a guide somewhere to aid me?

Thanks, Dan

推荐答案

使用cookies的原因是什么?这听起来有点奇怪,我认为饼干可能是其他问题的红鲱鱼。
What is it using the cookies for? This sounds a bit odd, and I think the cookies may be a red herring for some other issue.


这篇关于抓取公共网站 - 比Google Search Appliance更难?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆