随机化大型数据集 [英] randomizing large dataset

查看:107
本文介绍了随机化大型数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种从大型数据集中获取随机选择的方法.

I am trying to find a way to get a random selection from a large dataset.

我们希望数据集可以增长到约50万条记录,因此找到一种在数据集增长时保持良好性能的方法非常重要.

We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.

我尝试了以下技术: http://forums .mysql.com/read.php?24,163940,262235#msg-262235 但是它不是完全随机的,它不能很好地与LIMIT子句一起使用,您不一定总能得到记录所需的内容.

I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.

所以我想,由于PK是auto_increment,所以我只是生成一个随机ID的列表,并使用IN子句选择所需的行.这种方法的问题在于,有时我需要一个随机的数据集,这些记录的记录具有特殊状态,该状态最多占总数的5%.要进行这项工作,我首先需要找出具有该特定状态的我可以使用的ID,所以这也将不起作用.

So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.

我正在使用MyISAM存储引擎mysql 5.1.46.
重要的是要知道选择随机行的查询将非常频繁地运行,并且从中选择的表会频繁地追加.

I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.

任何帮助将不胜感激!

推荐答案

您可以通过一些非规范化来解决此问题:

You could solve this with some denormalization:

  • 构建包含与数据表相同的pkey和状态的辅助表
  • 添加并填充状态组列,这将是您自己自动编号的一种子pkey(相对于单个状态,基于1的自动增量)
Pkey    Status    StatusPkey
1       A         1
2       A         2
3       B         1
4       B         2
5       C         1
...     C         ...
n       C         m (where m = # of C statuses)

当您不需要过滤时,可以如上所述在pkey上生成rand#.当您确实需要过滤时,然后针对您感兴趣的特定状态的StatusPkey生成兰德.

When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.

有几种方法可以建立该表.您可以有一个间隔运行的过程,也可以实时运行.尽管后者会降低性能,因为计算StatusPkey可能会变得很昂贵.

There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.

这篇关于随机化大型数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆