在SQL中有偏向随机性吗? [英] Biased random in SQL?

查看:110
本文介绍了在SQL中有偏向随机性吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据库中有一些条目,例如我的视频,其中包含评分,受欢迎程度和其他因素.在所有这些因素中,我计算了一个或多个可能性因素,或者说是一个提升因素.

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.

所以我本质上具有ID和BOOST字段.boost的计算方式是一个整数,表示该条目在比较中应被击中的频率的百分比.

So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.

ID  Boost
1   1
2   2
3   7

因此,如果我无限期地运行随机函数,我应该在ID 1上获得X次点击,在ID 2上获得2倍,在ID 3上获得7倍.

So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.

因此,每次点击均应是随机的,但概率为(boost / sum of boosts).因此,本例中ID 3的概率应为0.7(因为总和为10.为简单起见,我选择了这些值).

So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).

我想到了类似以下查询的内容:

I thought about something like the following query:

SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();

不幸的是,在考虑了表中的以下条目之后,该方法不起作用:

Unfortunately that doesn't work, after considering the following entries in the table:

ID  Boost
1   1
2   2

它有50/50的机会只有第二个或两个元素可以随机选择.

It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.

所以0.5命中进入第二个元素 0.5击中的(第二和第一个)元素是随机选择的,因此每个0.25. 因此我们最终得到0.25/0.75的比率,但是应该是0.33/0.66

So 0.5 hit goes to the second element And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each. So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66

我需要一些修改或新方法来实现良好的性能.

I need some modification or new a method to do this with good performance.

我还考虑过累积存储boost字段,因此我只从(0-sum())进行范围查询,但是如果我更改了它或开发了一些交换算法或东西...但是那真的不是优雅的东西.

I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.

插入/更新和选择都应该很快!

Both inserting/updating and selecting should be fast!

您对此问题有解决方案吗?

Do you have any solutions to this problem?

要考虑的最佳用例可能是广告投放. 请选择具有给定概率的随机广告" ...但是我需要将其用于其他目的,而只是给您最后一张图片,它应该做什么.

The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.

感谢肯斯的回答,我想到了以下方法:

Thanks to kens answer i thought about the following approach:

  1. 从0和(随机提升)中计算一个随机值

  1. calculate a random value from 0-sum(distinct boost)

SET @randval =(从测试中选择ceil(rand()* sum(DISTINCT boost)));

SET @randval = (select ceil(rand() * sum(DISTINCT boost)) from test);

从所有明显超过随机值的提升因子中选择提升因子

select the boost factor from all distinct boost factors which added up surpasses the random value

那么在第一个示例中,我们有0.1的概率为0.1、2的概率为0.2和7的概率为0.7.

then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.

  1. 现在从所有具有该提升因子的条目中选择一个随机条目

问题:因为具有一次提升的条目数始终是不同的.例如,如果只有1个增强的项,那么我会在10个呼叫中的1个中得到它,但是如果有1个具有7个呼叫,则有100万个几乎都不会返回... 所以这行不通:(试图完善它.

PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned... so this doesnt work out :( trying to refine it.

我必须以某种方式包括具有该提升因子的条目数...但是我却以某种方式停留在那...

I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...

推荐答案

您需要为每行生成一个随机数并对其进行加权.

You need to generate a random number per row and weight it.

在这种情况下,RAND(CHECKSUM(NEWID()))绕过了RAND的每个查询"评估.然后只需将其乘以boost,然后将ORDER BY乘以结果DESC. SUM..OVER为您带来总的刺激

In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost

DECLARE @sample TABLE (id int, boost int)

INSERT @sample VALUES (1, 1), (2, 2), (3, 7)

SELECT
    RAND(CHECKSUM(NEWID())) * boost  AS weighted,
    SUM(boost) OVER () AS boostcount,
    id
FROM
    @sample
GROUP BY
    id, boost
ORDER BY
    weighted DESC

如果您有不同的提升值(我想您已经提到过),我也将考虑使用LOG(以e为底)来平滑分布.

If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.

最后,ORDER BY NEWID()是不考虑提升的随机性.播种RAND很有用,但不能单独播种.

Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.

此示例放在SQL Server 2008 BTW上

This sample was put together on SQL Server 2008, BTW

这篇关于在SQL中有偏向随机性吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆