如何提供多因素加权排序最相关的结果 [英] How to provide most relevant results with Multiple Factor Weighted Sorting

查看:1023
本文介绍了如何提供多因素加权排序最相关的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要提供一个加权排序上2+因素,下令相关性。然而,因素不完全隔离,​​在我想要一个影响紧迫性的因素或多个其它的(重量)

I need to provide a weighted sort on 2+ factors, ordered by "relevancy". However, the factors aren't completely isolated, in that I want one or more of the factors to affect the "urgency" (weight) of the others.

例如:所提供的内容(文章的),可以向上/向下投,因而有评级;他们有一个发布日期,而且他们也标记类别。用户写的文章,可以投票,并且可以或可以不具有某种排名本身(专家等)。大概类似于计算器,对吧?

Example: contributed content (articles) can be up-/down-voted, and thus have a rating; they have a post date, and they're also tagged with categories. Users write the articles and can vote, and may or may not have some kind of ranking themselves (expert, etc). Probably similar to StackOverflow, right?

我想提供与由标签分组的文章的列表中的每个用户,但依关联,其中的关联的是基于计算在制品的等级和年龄,以及可能受排名的作者。 I.E.这是写了几年前一个高排名的文章未必是有关为昨天的书面媒体排名的文章。也许如果一篇文章的作者是一个专家,将被视为超过一写乔Schmoe。

I want to provide each user with a list of articles grouped by tag but sorted by "relevancy", where relevancy is calculated based on the rating and age of the article, and possibly affected by the ranking of the author. I.E. a highly ranked article that was written several years ago may not necessarily be as relevant as a medium ranked article written yesterday. And maybe if an article was written by an expert it would be treated as more relevant than one written by "Joe Schmoe".

另一个很好的例子是<一个href="http://stackoverflow.com/questions/8661118/need-help-maximizing-3-factors-in-multiple-similar-objects-and-ordering-appropr/8759877#8759877"标题=最大化3因素和订购适当>指定酒店元分由价格,评级,和旅游景点。

我的问题是,什么是多因素排序最好的算法?这可能是<一个重复href="http://stackoverflow.com/questions/8661118/need-help-maximizing-3-factors-in-multiple-similar-objects-and-ordering-appropr/8759877#8759877"标题=最大化3因素和订购适当>这个问题,但我对任何数量的因素(一更合理的预期是2 - 4因素)感兴趣的一个通用的算法,preferably一个我没有扭捏或需要用户输入,而我无法解析线性代数和特征向量的古怪全自动功能。

My question is, what is the best algorithm for multiple factor sorting? This may be a duplicate of that question, but I'm interested in a generic algorithm for any number of factors (a more reasonable expectation is 2 - 4 factors), preferably a "fully-automatic" function that I don't have to tweak or require user input, and I can't parse linear algebra and eigenvector wackiness.

可能性,我发现迄今:

注:取值是排序分数

Note: S is the "sorting score"

  1. 线性加权 - 使用功能,如: S =(W <子> 1 *˚F<子> 1 )+ (W 2 *˚F 2 )+(W 3 *˚F<子> 3 ) ,其中是W <子> X 是任意分配的权重,而 F <子> X 的因素的值。你也希望要正常化 F (即 F <子> x_n = F <子> X /˚F <子>最大 )。我认为这是有点儿如何<一个href="http://stackoverflow.com/questions/817998/how-to-sort-search-results-on-multiple-fields-using-a-weighting-function"标题=问题重:Lucene的排序算法。> Lucene搜索工作
  2. 基地-N加权 - 更像是一个比加权分组,这其中权重不断增加以10的倍数(类似的原理的CSS选择特异性),让更多的重要的因素是显著高: S = 1000 *˚F<子> 1 + 100 *˚F 2 + 10 *˚F<子> 3 ...
  3. 估计真值(ETV) - 这显然是什么谷歌在他们的报告,介绍了分析,其中的一个因素影响的值(权重的)另一个因素 - 其结果是更多的统计显著值排序。链接解释它pretty的好,所以这里只是等式: S =(F 2 /˚F<子> 2_max *˚F<子> 1 )+((1 - (F 2 /˚F<子> 2_max ))*˚F<子> 1_avg ) ,其中 F 1 是更重要的因素(在文章跳出率),而 F <子> 2 是意义的修改的因素(在文章访问)。
  4. Bayes估计 - 看起来真的很相似,恩替卡韦,这是怎么IMDB计算它们的评级。见的这个计算器职位说明;公式: S =(F 2 /(F 2 + F <子> 2_lim ))*˚F<子> 1 +(F <子> 2_lim /(F 2 + F <子> 2_lim ))×F <子> 1_avg ,其中 F <子> X 是相同的#3,和 F <子> 2_lim 是最低临界限值时,为意义的因素(即小于X的任何值不应该被考虑)。
  1. "Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax). I think this is kinda how Lucene search works.
  2. "Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 + 10 * F3 ....
  3. Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
  4. Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).

选项#3或#4看起来真的有希望的,因为你真的没有选择任意的权重方案像你这样的#1和#2,但问题是你怎么做了两个多因素?

Options #3 or #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but the problem is how do you do this for more than two factors?

我也遇到了 SQL语句执行的双因素加权算法,这基本上是我需要最终写入。

I also came across the SQL implementation for a two-factor weighting algorithm, which is basically what I'll need to write eventually.

推荐答案

正如评论,我会建议什么所谓的妥协方案,任何人只要有一个类似的问题是谁更关心的是没有设置的权重比以使得一个标准更大的权重比其他人。

As mentioned in the comments, I would suggest what's called the 'compromise solution' to anyone with a similar problem who is more concerned with not having to set weights than with making one criterion more heavily weighted than the others.

基本上,你考虑(当然之后正常化,),您的每一个标准作为一个坐标。根据你的判断,你选择的绝对最佳点,例如:在这种情况下,最高等级的作者,最新的文章等,一旦你选择的最优解,彼此的方案是基于从该最佳的距离分级。样本公式将每篇文章的得分欧几里得距离的倒数:S = 1 /(开方((排名 - rank_ideal)^ 2 +(年龄 - age_ideal)^ 2 + ... +(XN - xn_ideal)^ 2 ))。

Basically, you consider each of your criterion as a coordinate (after normalization, of course). Based on your judgement, you choose the absolute optimal point, e.g. in this case, the highest rank author, the newest article, etc. Once you choose the optimal solution, each other 'solution' is rated based on its distance from that optimal. A sample formula would be the inverse of the Euclidean distance for each article's score: S = 1/(sqrt((rank - rank_ideal)^2 + (age - age_ideal)^2 + ... + (xn - xn_ideal)^2)).

这将所有的标准作为平等的,所以记住这一点。

This treats all criteria as equal, so keep that in mind.

这篇关于如何提供多因素加权排序最相关的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆