检索指定的URL最流行的GET参数变化的名单? [英] Retrieve a list of the most popular GET param variations for a given URL?

查看:182
本文介绍了检索指定的URL最流行的GET参数变化的名单?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在围绕建设环节传播智慧,因为我需要处理在需要从一个确切的URL地址反向查找许多短网址服务,我需要能够解决的多个版本的近似同一个URL。

I'm working on building intelligence around link propagation, and because I need to deal with many short URL services where a reverse-lookup from an exact URL address is required, I need to be able to resolve multiple approximate versions of the same URL.

一个例子是像 HTTP网址://www.example .COM REF = affil&放大器; HL = EN&安培; CT = 0

当然,改变在某些情况下的GET PARAMS可以指一个完全不同的页,尤其是如果有问题的GET PARAMS指信息或内容ID

Of course, changing GET params in certain circumstances can refer to a completely different page, especially if the GET params in question refer to a profile or content ID.

但页面的快速解析会很快确定的网页如何相似均彼此。使用位机器学习的,它可能很快变得清晰而GET PARAMS不影响页面的内容返回给定站点。

But a quick parse of the page would quickly determine how similar the pages were to each other. Using a bit of machine learning, it could quickly become clear which GET params don't effect the content of the pages returned for a given site.

我假设一个服务发送一个网址,并获得只能通过谷歌或雅虎(或Twitter)那样的人所提供的非常相似的网址列表,但他们似乎并没有提供此功能,我有没有发现,做任何其他服务。

I'm assuming a service to send a URL and get a list of very similar URLs could only be offered by the likes of Google or Yahoo (or Twitter), but they don't seem to offer this feature, and I haven't found any other services that do.

如果你知道的那些集群几乎相同的URL组一起在上述方法的任何服务,请让我知道。

If you know of any services that do cluster together groups of almost identical URLs in the aforementioned way, please let me know.

我的慷慨一个拥抱。

推荐答案

这听起来像你需要建立某种形式的页面之间的离散相似等级。这可以通过找到的相似字两页之间的数字和的值归到一个有限范围,然后绘制范围的某些部分,以不同的相似度等级来完成。

It sounds like you need to create some sort of discrete similarity rank between pages. This could be done by finding the number of similar words between two pages and normalizing the value to a bounded range then mapping certain portions of the range to different similarity ranks.

您还需要知道每一个你比较什么得到他们的共同点参数对或如何接近他们。这些信息将成为定义每个实例的属性(沿侧上面提到的等级存储)。你已经积累了几百双的比较之后,你也许可以做一些功能子集选择来确定的GET参数,大部分确定两页的相似程度。

You would also need to know for each pair that you compare what GET parameters they had in common or how close they were. This information would become the attributes that define each of your instances (stored along side the rank mentioned above). After you have amassed a few hundred pairs of comparisons you could perhaps do some feature subset selection to identify the GET parameters that most identify how similar two pages are.

当然,这可能最终因为这数据可能包含噪音很大没有找到任何东西是有用的。

Of course, this could end up not finding anything useful at all as this dataset is likely to contain a great deal of noise.

如果你有兴趣在这个方法你应该看看Infogain和一般的特征选择。这是我的教授讲义一个链接,可能会派上用场。 http://stuff.ttoy.net/cs591o/FSS.html

If you are interested in this approach you should look into Infogain and feature subset selection in general. This is a link to my professors lecture notes which may come in handy. http://stuff.ttoy.net/cs591o/FSS.html

这篇关于检索指定的URL最流行的GET参数变化的名单?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆