从另一个表填充随机数据 [英] Populate random data from another table
问题描述
更新dataset1.test
设置column4 =(按数据集2
的顺序通过random()选择column1
limit 1
)
我必须更新第4列的dataset1,每一行都要更新数据集2列中的随机条目..但到目前为止,在上面的查询中,我在dataset1的所有行中仅获得一个随机条目,而我希望它是随机的。
设置
首先假设您的表数据如下。
注意,我假设 dataset1
有一个主键(它可以是一个复合键,但是为了简单起见,让我们将其设为整数):
创建表数据集1
(
id整数主键,
column4 TEXT
);
创建表数据集2
(
column1 TEXT
);
我们在两个表中都填充了示例数据
插入数据集1
(id,第4列)
SELECT
i,第4列,ID为| i
from
generate_series(101,120)AS s(i);
插入数据集2
(第1列)
选择
‘SOMETHING’|| i
从
generate_series(1001,1020)AS s(i);
健全性检查:
SELECT count(DISTINCT column4)来自数据集1;
|数|
| ----:|
| 20 |
情况1:数据集1中的行数<==数据集2中的行
我们将执行完整的改组。来自dataset2的值将被使用一次,且不会重复使用。
EXPLANATION
为了进行更新,以
随机的方式洗排 column4
中的所有值,我们需要一些中间步骤。
首先,对于 dataset1
,我们需要创建一个元组(id,rn)<的列表(关系)
,则
只是:
(id_1,1),
( id_2,2),
(id_3,3),
...
(id_20,20)
其中 id_1
,..., id_20
是 dataset1
。
它们可以是任何类型,不需要连续,也可以是复合的。
对于 dataset2
,我们需要创建另一个(column_1,rn)
的列表,如下所示:
(column1_1,17),
(column1_2,3),
(column1_3,11),
...
(column1_20,15 )
在这种情况下,第二列包含所有值1 .. 20,但将其随机排序。 / p>
一旦我们建立了两个关系,我们 JOIN
将它们 ON ... rn
。实际上,这会生成另一个具有(id,column1)
的元组列表,其中的配对是随机进行的。我们使用这些对来更新 dataset1
。
真实查询
这可以通过使用某些CTE( WITH
语句)来保持中间关系来完成(显然,我希望如此):
与original_keys AS
(
-这会创建元组(id,rn),
-其中rn从1增加到数字或行
SELECT
id,
row_number()OVER()AS rn
FROM
dataset1
)
,shuffled_data AS
(
-这会创建元组(column1,rn)
-其中rn在1和行数之间移动,但会随机进行
SELECT
column1,
-下一条语句是*洗牌*所有数据
row_number()OVER(ORDER BY random())AS rn
FROM
data2
)
-您可以使用经过改组的数据更新数据集1
-链接回原始键
更新
数据集1
SET
column4 = shuffled_data.column1
从
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
数据集1。 id = original_keys.id;
请注意,技巧是通过以下方式执行的:
row_number()OVER(ORDER BY random())AS rn
row_number()
窗口函数,该函数从1开始生成与行数一样多的连续数字。
这些数字是随机混洗的,因为 OVER
子句接收所有数据并随机排序。
检查
我们可以再次检查:
SELECT count(DISTINCT column4)FROM data1;
|数|
| ----:|
| 20 |
SELECT * FROM数据集1;
id | column4
-:| :-------------
101 | 1016年
102 | 1009
103 |某物1003
...
118 | 1012年
119 |东西1017
120 | 1011
替代
也可以通过简单替换而不是CTE来处理子查询。在某些情况下这可能会提高性能:
UPDATE
数据集1
SET
列4 = shuffled_data .column1
FROM
(选择
column1,
row_number()OVER(ORDER BY random())AS rn
FROM
dataset2
)AS shuffled_data
JOIN
(SELECT
id,
row_number()OVER()AS rn
FROM
dataset1
)AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id;
再次...
SELECT * FROM数据集1;
id | column4
-:| :-------------
101 | 1011年
102 | 1018年
103 |大约1007
...
118 | 1020年
119 | 1002
120 | 1016
您可以在 dbfiddle 此处
注意:如果要对非常大的数据集进行此操作,请不要期望它会非常快。混洗一大堆纸牌非常昂贵。
案例2:数据集1中的行数>数据集2中的行
在这种情况下,列4
的值可以重复多次。
我能想到的最简单的可能性(可能不是一种有效的方法,但很容易理解)是创建一个函数 random_column1
,标记为 VOLATILE
:
创建功能random_column1()
返回文本
挥发性-重要!
语言SQL
AS
$$
选择
列1
从
数据集2
ORDER BY
random()
LIMIT
1;
$$;
并使用它来更新:
UPDATE
数据集1
SET
column4 = random_column1();
这样, dataset2
中的某些值可能根本不使用,而其他 将不止一次使用。
dbfiddle 此处
update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)
I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.
SETUP
Let's start by assuming your tables an data are the following ones.
Note that I assume that dataset1
has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):
CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;
CREATE TABLE dataset2
(
column1 TEXT
) ;
We fill both tables with sample data
INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);
INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;
Sanity check:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count | | ----: | | 20 |
Case 1: number of rows in dataset1 <= rows in dataset2
We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.
EXPLANATION
In order to make an update that shuffles all the values from column4
in a
random fashion, we need some intermediate steps.
First, for the dataset1
, we need to create a list (relation) of tuples (id, rn)
, that
are just:
(id_1, 1),
(id_2, 2),
(id_3, 3),
...
(id_20, 20)
Where id_1
, ..., id_20
are the ids present on dataset1
.
They can be of any type, they need not be consecutive, and they can be composite.
For the dataset2
, we need to create another list of (column_1,rn)
, that looks like:
(column1_1, 17),
(column1_2, 3),
(column1_3, 11),
...
(column1_20, 15)
In this case, the second column contains all the values 1 .. 20, but shuffled.
Once we have the two relations, we JOIN
them ON ... rn
. This, in practice, produces yet another list of tuples with (id, column1)
, where the pairing has been done randomly. We use these pairs to update dataset1
.
THE REAL QUERY
This can all be done (clearly, I hope) by using some CTE (WITH
statement) to hold the intermediate relations:
WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
Note that the trick is performed by means of:
row_number() OVER (ORDER BY random()) AS rn
The row_number()
window function that produces as many consecutive numbers as there are rows, starting from 1.
These numbers are randomly shuffled because the OVER
clause takes all the data and sorts it randomly.
CHECKS
We can check again:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count | | ----: | | 20 |
SELECT * FROM dataset1 ;
id | column4 --: | :------------- 101 | SOMETHING 1016 102 | SOMETHING 1009 103 | SOMETHING 1003 ... 118 | SOMETHING 1012 119 | SOMETHING 1017 120 | SOMETHING 1011
ALTERNATIVE
Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
And again...
SELECT * FROM dataset1;
id | column4 --: | :------------- 101 | SOMETHING 1011 102 | SOMETHING 1018 103 | SOMETHING 1007 ... 118 | SOMETHING 1020 119 | SOMETHING 1002 120 | SOMETHING 1016
You can check the whole setup and experiment at dbfiddle here
NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.
Case 2: number of rows in dataset1 > rows in dataset2
In this case, values for column4
can be repeated several times.
The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1
, marked as VOLATILE
:
CREATE FUNCTION random_column1()
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;
And use it to update:
UPDATE
dataset1
SET
column4 = random_column1();
This way, some values from dataset2
might not be used at all, whereas others will be used more than once.
dbfiddle here
这篇关于从另一个表填充随机数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!