网络爬虫是否应该接受查询? [英] Should a web-crawler pick up queries?
问题描述
最近我编写了一个网络爬虫程序。我留下的唯一问题是,标准网络爬虫是否抓取像这样的链接查询:
https://www.google.se/?q=stackoverflow
或是否跳过查询并按如下方式提取:
https://www.google.se
The latest days I have coded a web-crawler. The only question I have left is, does "standard" web-crawlers crawl links queries like this one: https://www.google.se/?q=stackoverflow or does it skip the queries and pick them up like this: https://www.google.se
推荐答案
如果你指的是抓取某种网页资源的索引:
In case you are referring to crawling for some sort of indexing of web resources:
答案很长,但总之我的意见是:
如果你有这个页面/资源: https://www.google.se/?q=stackoverflow指向许多其他页面(即它有一个很大的链接度),然后没有将它集成到你的索引可能意味着你错过了webgraph中非常重要的节点。另一方面,想象一下这种类型的google.com/q=\"query链接有多少在网络上。可能是一个庞大的数字,所以这对你的爬虫/索引器系统来说肯定是一个巨大的开销。
The answer is very long but in short my opinion is that: if you have this "page/resource": https://www.google.se/?q=stackoverflow pointed to by many other pages (i.e. it has a large in-link degree) then not integrating it to your index might mean that you miss a very important node in the webgraph. On the other hand, imagine how many links of this type google.com/q="query" are there on the web. Probably a huge number so this would certainly be a huge overhead for your crawler/indexer system.
这篇关于网络爬虫是否应该接受查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!