SQL大表中的随机行(带where子句) [英] SQL Random rows in a big table (with where clause)

查看:139
本文介绍了SQL大表中的随机行(带where子句)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网站,人们可以在汽车上投票。

I have a website where people can vote on cars. 4 cars are showed to the user and he/she can vote on the car they like most.

cars 有重要的列:

car_id   int(10) (not auto_increment, so has gaps)
views    int(7)
points   int(7)
car_type int(1) (value = 1, 2 or 3)

此刻,我使用映射表为所有car_types,有一个PK没有间隙。我选择映射表的最大ID并创建4个随机数(PHP),从映射中选择那些行,并获取相应的car_id。这些数字用于从 cars 表中选择汽车。

At the moment I use a mapping table for all car_types, which has a PK which has no gaps. I select the max ID of the mapping table and create 4 random numbers (PHP), select those rows from mapping and get the corresponding car_id's. These numbers I use to select the cars from cars table.

问题后来添加到数据库的汽车与之前添加的汽车的机会相同。

The problem is that cars added later to the database have less chance to get on the same points as cars added earlier.

我的问题是如何显示具有相同点数(随机) )。还有重要注意事项:

My question is how to show 4 cars which have the same amount of points(random) ordered by the least number of views (views asc). Also the important notes:


  • 选择只应使用至少 1个点查询汽车。

  • 数据库将包含超过30M的汽车,它不是关于汽车,但对于问题,我认为更容易:

  • 当70%汽车有1分,20%有2分和10%有3分,比随机点应该选择汽车70%的时间与1分20%与2分和10%与3分。

  • 查询将用于向访问者显示4辆汽车,我们都知道用户不耐烦,所以查询速度越快越好:)

  • 我可以)使用映射表,这将在PK中没有间隙(如我现在)。

  • 将显示某个car_type内的汽车。例如,4种汽车类型2(这是家庭汽车)的随机,因为我不想同时显示跑车和家庭汽车。

  • The select should only query cars with at least 1 point.
  • The database will contain more than 30M cars, it's not about cars but for the question I think it's easier :).
  • When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.
  • The query will be used to show 4 cars to the visitor, we all know users are impatient so the faster the query the better :)
  • I could (if needed) use a mapping table, which will have no gaps in the PK (as I have now).
  • Cars only within a certain car_type will be showed. Example, 4 randoms of car type 2 (which is family cars), since I don't want to show sport cars and family cars at same time.

如果您知道另一个解决方案来解决上述问题,我愿意接受所有类型解决方案(PHP / SQL)。

If you know another solution to solve the problem above, I'm willing to accept all kind of solutions (PHP/SQL).

赏金因为这是一个比平均Stackoverflow问题更大的问题赏金将奖励给描述解决方案的人或(首选)解决方案的代码。

Bounty because it's a bigger question (/answer) than the average Stackoverflow question. Bounty will be rewarded to person which describes the solution or (preferred) the code of the solution. Anyways it's my way to thank the persons helping me and to ensure I appreciate your help greatly.

更新:

感谢所有的答案到目前为止!你的答案是正确的。我确实想了很多关于它最后几个小时,我开始意识到,数据库实际上从来没有这样的东西(显示随机数据),它是创建显示精确准确的数据。这就是为什么选择具有30M行或更多的PK仍然非常快。这就是为什么我正在考虑做所有的随机的东西在PHP。所以我在PHP中生成40个随机数,并从右侧车辆类型的映射表中选择40行。这个选择 IN 真的很快(如0.0006秒)。之后,我得到40 car_ids,我也选择与 IN 从cars表。我循环车,把它们放在一个数组,并做一些自定义排序(基于点和视图)。之后,我从40辆汽车中的所有点中选择一个随机数,并从最接近这个点数和最少视图的阵列中抓取汽车。这样PHP处理随机性和视图部分和查询,因为你要求精确的数据是非常快的(每个0.0006秒)。

Thanks for all the answers so far! You're all right in your answers. I did think a lot about it last few hours and I started to realize that databases were actually never build for things like this (showing random data), it was created to show precise and accurate data with fast access. That's why selects on PK's with 30M rows or more are still very fast. Thats why I'm thinking about doing all the random stuff in PHP. So I generate in PHP like 40 random numbers and select those 40 rows from the mapping table of the right car type. This select with IN is really fast (like 0.0006 seconds). After this select I got 40 car_ids which I also select with IN from the cars table. I loop the cars and put them in an array and do some custom sorts (based on points and views). After this I select a random number from all the points within the 40 cars and grab the cars from the array closest to this number of points and with the least views. This way PHP takes care of the randomness and the views part and the queries because you ask for precise data are very fast (0.0006 seconds each).

推荐答案

我想给出一个具体的答案,但我需要帮助来了解你的思考过程...

I'd love to give a specific answer, but I'd need help to understand your thought process...

/ p>

You start by writing:


我有一个网站,人们可以投票(...)他们最喜欢的车。

I have a website where people can vote on (...) the car they like most.

问题是,稍后添加到数据库的汽车与之前添加的汽车的机会相同。

The problem is that cars added later to the database have less chance to get on the same points as cars added earlier.

但你还是继续写作:


当70%的车有1分,20%有2分和10分%有3分,比随机点应该选择汽车70%的时间与1点20%与2分和10%与3分。

When 70% of the cars have 1 points, 20% have 2 points and 10% have 3 points, than the random points should select cars 70% of the time with 1 point 20% with 2 points and 10% with 3 points.

对我来说,后面的规格根据第一句话没有意义。

To me, the latter spec makes little sense in light of the first remark.

Imho,你真正想要的是用户有相同的数额的机会在每辆车上投票。

Imho, what you really want is for users to have the same amount of opportunities to vote on each car. Or more precisely, to vote for each car compared to each other car.

如果你假设(car)变量是独立的,那么你需要保持一个计数一个选择出现了多少次,而不是多少次投票,并相应地调整你的决策过程。这是一个数学问题,它不是那么丑陋,它可以被翻译成SQL为好或坏 - 我敢冒险,它可能会更糟。

If you assume that the (car) variables are independent, then you need to keep a count of how many times a choice came up, rather than how many times it got voted for, and adjust your decision process accordingly. It's a math problem, it's not so ugly, and it can then be translated into SQL for better or worse -- I'd venture that it'll probably be worse.

如果你假设,像我一样,他们不是独立的,你还需要考虑相关性,并存储他们相互了解多少次。因为,好吧,有一个无限苗条的机会,你不会喜欢这个梅赛德斯,而不是那个Tata,那个Xinkai或AvtoVAZ。但是考虑到同一辆梅赛德斯,宝马,保时捷和法拉利之间的选择,决定可能不是那么清楚。

If you assume, like I do, that they're not independent, you also need to account for correlations -- and store how many times they came up with each other too. Because, well, there's an infinitely slim chance that you won't prefer this Mercedes rather than that Tata, that Xinkai or that AvtoVAZ. But given a choice between the same Mercedes, a BMW, a Porsche and a Ferrari, the decision might not be so clearcut.

换句话说,你的规格不

我目前正在乞求同意两个小时前发布的答案:选择他们真的随机,你会满足没有额外的代码...

I'm currently begging to agree with the answer posted two hours ago: pick them really random, and you'll be satisfied without extra code...

注意,如果你的id真的没有空隙, php或什么,并使用在()语句中获取它们。你不会比这更有效率。

As a side note, if your ids really have no gaps, generate four ids in php or whatever and fetch them using an in() statement. You won't get more efficient than that.

这篇关于SQL大表中的随机行(带where子句)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆