算法来计算基于其观点/评论的网页重要性 [英] Algorithm to calculate a page importance based on its views / comments

查看:173
本文介绍了算法来计算基于其观点/评论的网页重要性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个算法,使我能够确定一个合适的<优先级> 现场为我的网站的的基于页面的意见和评论数网站地图

对于那些你不熟悉的网站地图,优先级字段用于信号页面相对于其他在同一网站上的重要性。它必须介于0和1之间的小数。

该算法将接受两个参数,观看次数 commentCount ,并将返回的优先级。例如:

  GetPriority(100000,100000); //妈的,很多意见/评论!返回的值将非常接近1,例如0.995
GetPriority(3,2); //确定不是很多的用户有兴趣在此页面,因此,例如它会返回0.082
 

解决方案

您提到做这在一个SQL查询,所以我给样本中的。

如果你有一个表/视图,像这样

 页
-----
PAGE_ID:INT
观点:INT  - 索引
评论:INT  - 索引
 

然后就可以通过写命令他们

  SELECT * FROM页面
ORDER BY
    (0.3 + LOG10(10 +美景)/ LOG10(10+(SELECT MAX(视图)从网页)))+
    (0.7 + LOG10(10 +评论)/ LOG10(10+(SELECT MAX(评论)FROM页)))
 

我特意挑选的看法和意见之间不平等的权重。可在保持同等权重,可欣赏/评论出现的问题是,排名变成一个自我实现的预言 - 页面是在列表的顶部回来,所以它的访问更加频繁,从而获得更多的积分,所以它的显示在列表的停止,它的访问次数多了,它得到更多的积分....上的评论把更多的权重反映,这些需要真正的努力和表现出真正的兴趣。

以上公式会给你的排名基于所有的统计数据。这样积累的意见/评论相同数量在上周的另一篇文章中积累在去年将给予相同的优先级的文章。它可能是有意义的重复公式,每次指定日期的范围,并有利于活性较高的网页,如:

  0.3 *(得分意见/评论今天) - 实时数据
  0.3 *(评分意见/在上周评论)
  0.25 *(评分意见/上个月评论)
  0.15 *(得分为所有的意见/评论,所有的时间)
 

这将确保热页的优先级高于那些没有看到太大的动作,最近同样打进页面。所有除了今天的分数值可以通过预定的存储过程,这样数据库是不是有聚集很多很多评论/浏览统计数据持久化在表中。才有了今天的统计计算活。把它一步,排名公式本身可以通过一个存储过程每天运行的计算和存储的历史数据。

编辑:为了得到一个严格的范围从0.1至1.0,你会motify这样的公式。但我强调 - 这只会增加开销,是不必要的 - 优先级的绝对值并不重要 - 只有他们的相对值到其他网址。该搜索引擎使用这些回答的问题,是URL更重要/相关超过网址B.?这是通过比较它们的优先级 - 哪一个是最大的 - 而不是它们的绝对值

//非标准化 - x是一些页面的id    联合国(X)= 0.3 *日志(意见(X)+10)/日志(10 + maxViews())+            0.7 *日志(评论(X)+10)/日志(10 + maxComments())    //原来的公式(现伪code)

最大为1.0,最小将开始在1.0和向下移动的多个视图/评论制成。

我们定义未(0)为最小值,即(其中次(x)和评论(x)的都是0在上式)

要获得一个标准化的公式为0.1〜1.0,你再计算N(x)的,标准化的优先级页 X

 (1.0-UN(X))*(UN(0)-0.1)
  N(x)的未=(X) -  -------------------------当未(0)= 1.0
                          1.0-UN(0)

       = 0.1,否则。
 

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.

For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.

The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:

GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082

解决方案

You mentioned doing this in an SQL query, so I'll give samples in that.

If you have a table/view Pages, something like this

Pages
-----
page_id:int
views:int  - indexed
comments:int - indexed

Then you can order them by writing

SELECT * FROM Pages
ORDER BY 
    (0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +       
    (0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))

I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.

The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.

  0.3*(score for views/comments today) - live data
  0.3*(score for views/comments in the last week)
  0.25*(score for views/comments in the last month)
  0.15*(score for all views/comments, all time)

This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.

EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.

// unnormalized - x is some page id un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) + 0.7*log(comments(x)+10)/log(10+maxComments()) // the original formula (now in pseudo code)

The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.

we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)

To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x

                  (1.0-un(x)) * (un(0)-0.1)
  n(x) = un(x) -  -------------------------    when un(0) != 1.0
                          1.0-un(0)

       = 0.1 otherwise.

这篇关于算法来计算基于其观点/评论的网页重要性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆