网络爬虫是否应该接受查询? [英] Should a web-crawler pick up queries?

查看:149
本文介绍了网络爬虫是否应该接受查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我编写了一个网络爬虫程序。我留下的唯一问题是,标准网络爬虫是否抓取像这样的链接查询:
https://www.google.se/?q=stackoverflow
或是否跳过查询并按如下方式提取:
https://www.google.se

The latest days I have coded a web-crawler. The only question I have left is, does "standard" web-crawlers crawl links queries like this one: https://www.google.se/?q=stackoverflow or does it skip the queries and pick them up like this: https://www.google.se

推荐答案

如果你指的是抓取某种网页资源的索引:

In case you are referring to crawling for some sort of indexing of web resources:

答案很长,但总之我的意见是:
如果你有这个页面/资源: https://www.google.se/?q=stackoverflow指向许多其他页面(即它有一个很大的链接度),然后没有将它集成到你的索引可能意味着你错过了webgraph中非常重要的节点。另一方面,想象一下这种类型的google.com/q=\"query链接有多少在网络上。可能是一个庞大的数字,所以这对你的爬虫/索引器系统来说肯定是一个巨大的开销。

The answer is very long but in short my opinion is that: if you have this "page/resource": https://www.google.se/?q=stackoverflow pointed to by many other pages (i.e. it has a large in-link degree) then not integrating it to your index might mean that you miss a very important node in the webgraph. On the other hand, imagine how many links of this type google.com/q="query" are there on the web. Probably a huge number so this would certainly be a huge overhead for your crawler/indexer system.

这篇关于网络爬虫是否应该接受查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆