torrentz.eu之类的网站如何收集其内容? [英] How do websites like torrentz.eu collect their content?

查看:87
本文介绍了torrentz.eu之类的网站如何收集其内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道一些搜索网站如何获得其内容。
我在标题中使用了 torrentz.eu示例,因为它包含来自多个来源的内容。
我想知道这个系统的背后;他们只是解析他们支持的所有网站,然后显示内容吗?还是使用某些Web服务?还是两者都用?

I would like to know how some search website get their content. I have used in the title the example of 'torrentz.eu' because it has content from several sources. I would like to know what is behind this system; do they 'simply' parse all the website they support and then show the content? Or using some web service? Or both?

推荐答案

您正在寻找信息检索的Web_crawler rel = nofollow noreferrer>爬网。

You are looking for the Crawling aspect of Information Retrieval.

基本爬网是:假设初始设置的网站 S 个网站,请尝试通过探索链接进行扩展(查找传递闭包 1 )。

Basically crawling is: Given an initial set S of websites, try to expand it by exploring the links (Find the transitive closure1).

某些网站还使用了重点爬虫,如果他们尝试从一开始就只索引一部分网络。

Some web sites also used focused crawlers, if they try to index only a subset of the web from the first place.

PS某些网站既不使用,也不使用 Google自定义搜索API /提供的服务雅虎老板 / 必应开发API API(当然是收费的),并使用它们的索引,而不是自己创建一个。

P.S. Some website do neither, and use the service provided by Google Custom Search API/Yahoo Boss/Bing Deveoper APIs (for a fee, of course), and use their index, instead of creating one by their own.

PPS这提供了一种理论上的方法,我不知道所提到的网站是如何工作的。

P.P.S This is providing a theoretic approach how one can do it, I have no idea how the mentioned website actually works.

(1)由于时间问题,通常无法找到传递闭包,但足够接近它。

(1) Due to time issues, the transitive closure is usually not found, but something close enough to it.

这篇关于torrentz.eu之类的网站如何收集其内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆