从另一个表填充随机数据 [英] Populate random data from another table

查看：66 发布时间：2020/5/29 21:16:41 postgresql amazon-redshift

本文介绍了从另一个表填充随机数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

 更新dataset1.test 
设置column4 =（按数据集2 
的顺序通过random（）选择column1 
 
 limit 1 
）

我必须更新第4列的dataset1，每一行都要更新数据集2列中的随机条目..但到目前为止，在上面的查询中，我在dataset1的所有行中仅获得一个随机条目，而我希望它是随机的。

解决方案

设置

首先假设您的表数据如下。
注意，我假设 dataset1 有一个主键（它可以是一个复合键，但是为了简单起见，让我们将其设为整数）：

 创建表数据集1 
（
 id整数主键，
 column4 TEXT 
）; 
 
创建表数据集2 
（
 column1 TEXT 
）;

我们在两个表中都填充了示例数据

 插入数据集1 
（id，第4列）
 SELECT 
i，第4列，ID为| i 
 from 
 generate_series（101，120）AS s（i）; 
 
插入数据集2 
（第1列）
选择
‘SOMETHING’|| i 
从
 generate_series（1001，1020）AS s（i）;

健全性检查：

  SELECT count（DISTINCT column4）来自数据集1;

 
 |数| 
 | ----：| 
 | 20 |

情况1：数据集1中的行数<==数据集2中的行

我们将执行完整的改组。来自dataset2的值将被使用一次，且不会重复使用。

EXPLANATION

为了进行更新，以
随机的方式洗排 column4 中的所有值，我们需要一些中间步骤。

首先，对于 dataset1 ，我们需要创建一个元组（id，rn）<的列表（关系），则
只是：

 （id_1，1），
（ id_2，2），
（id_3，3），
 ... 
（id_20，20）

其中 id_1 ，...， id_20 是 dataset1 。
它们可以是任何类型，不需要连续，也可以是复合的。

对于 dataset2 ，我们需要创建另一个（column_1，rn）的列表，如下所示：

 （column1_1，17），
（column1_2，3），
（column1_3，11），
 ... 
（column1_20，15 ）

在这种情况下，第二列包含所有值1 .. 20，但将其随机排序。 / p>

一旦我们建立了两个关系，我们 JOIN 将它们 ON ... rn 。实际上，这会生成另一个具有（id，column1）的元组列表，其中的配对是随机进行的。我们使用这些对来更新 dataset1 。

真实查询

这可以通过使用某些CTE（ WITH 语句）来保持中间关系来完成（显然，我希望如此）：

 与original_keys AS 
（
-这会创建元组（id，rn），
 -其中rn从1增加到数字或行
 SELECT 
 id，
 row_number（）OVER（）AS rn 
 FROM 
dataset1 
）
，shuffled_data AS 
（
-这会创建元组（column1，rn）
-其中rn在1和行数之间移动，但会随机进行
 SELECT 
 column1，
-下一条语句是*洗牌*所有数据
 row_number（）OVER（ORDER BY random（））AS rn 
 FROM 
 data2 
）
-您可以使用经过改组的数据更新数据集1 
-链接回原始键
更新
数据集1 
 SET 
 column4 = shuffled_data.column1 
从
 shuffled_data 
 JOIN original_keys ON original_keys.rn = shuffled_data.rn 
 WHERE 
数据集1。 id = original_keys.id;

请注意，技巧是通过以下方式执行的：

  row_number（）OVER（ORDER BY random（））AS rn

row_number（）窗口函数，该函数从1开始生成与行数一样多的连续数字。
这些数字是随机混洗的，因为 OVER 子句接收所有数据并随机排序。

检查

我们可以再次检查：

  SELECT count（DISTINCT column4）FROM data1;

 
 |数| 
 | ----：| 
 | 20 |

  SELECT * FROM数据集1;

 
 id | column4 
-：| ：------------- 
 101 | 1016年
 102 | 1009 
 103 |某物1003 
 ... 
 118 | 1012年
 119 |东西1017 
 120 | 1011

替代

也可以通过简单替换而不是CTE来处理子查询。在某些情况下这可能会提高性能：

  UPDATE 
数据集1 
 SET 
列4 = shuffled_data .column1 
 FROM 
（选择
 column1，
 row_number（）OVER（ORDER BY random（））AS rn 
 FROM 
dataset2 
 ）AS shuffled_data 
 JOIN 
（SELECT 
 id，
 row_number（）OVER（）AS rn 
 FROM 
dataset1 
）AS original_keys ON original_keys.rn = shuffled_data.rn 
 WHERE 
dataset1.id = original_keys.id;

再次...

  SELECT * FROM数据集1;

 
 id | column4 
-：| ：------------- 
 101 | 1011年
 102 | 1018年
 103 |大约1007 
 ... 
 118 | 1020年
 119 | 1002 
 120 | 1016

您可以在 dbfiddle 此处

注意：如果要对非常大的数据集进行此操作，请不要期望它会非常快。混洗一大堆纸牌非常昂贵。

案例2：数据集1中的行数>数据集2中的行

在这种情况下，列4 的值可以重复多次。

我能想到的最简单的可能性（可能不是一种有效的方法，但很容易理解）是创建一个函数 random_column1 ，标记为 VOLATILE ：

 创建功能random_column1（）
返回文本
挥发性-重要！ 
语言SQL 
 AS 
 $$ 
选择
列1 
从
数据集2 
 ORDER BY 
 random（） 
 LIMIT 
 1; 
 $$;

并使用它来更新：

  UPDATE 
数据集1 
 SET 
 column4 = random_column1（）;

这样， dataset2 中的某些值可能根本不使用，而其他将不止一次使用。

dbfiddle 此处

update dataset1.test
   set column4 = (select column1 
                 from dataset2
                 order by random()
                 limit 1
                 )

I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.

解决方案

SETUP

Let's start by assuming your tables an data are the following ones. Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):

CREATE TABLE dataset1
(
     id INTEGER PRIMARY KEY,
     column4 TEXT
) ;

CREATE TABLE dataset2
(
    column1 TEXT
) ;

We fill both tables with sample data

INSERT INTO dataset1
    (id, column4)
SELECT
    i, 'column 4 for id ' || i
FROM
    generate_series(101, 120) AS s(i);

INSERT INTO dataset2
    (column1)
SELECT
    'SOMETHING ' || i
FROM 
    generate_series (1001, 1020) AS s(i) ;

Sanity check:

SELECT count(DISTINCT column4) FROM dataset1 ;

| count |
| ----: |
|    20 |

Case 1: number of rows in dataset1 <= rows in dataset2

We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.

EXPLANATION

In order to make an update that shuffles all the values from column4 in a random fashion, we need some intermediate steps.

First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that are just:

(id_1,   1),
(id_2,   2),
(id_3,   3),
...
(id_20, 20)

Where id_1, ..., id_20 are the ids present on dataset1. They can be of any type, they need not be consecutive, and they can be composite.

For the dataset2, we need to create another list of (column_1,rn), that looks like:

(column1_1,  17),
(column1_2,   3),
(column1_3,  11),
...
(column1_20, 15)

In this case, the second column contains all the values 1 .. 20, but shuffled.

Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.

THE REAL QUERY

This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:

WITH original_keys AS
(
    -- This creates tuples (id, rn), 
    -- where rn increases from 1 to number or rows
    SELECT 
        id, 
        row_number() OVER  () AS rn
    FROM 
        dataset1
)
, shuffled_data AS
(
    -- This creates tuples (column1, rn)
    -- where rn moves between 1 and number of rows, but is randomly shuffled
    SELECT 
        column1,
        -- The next statement is what *shuffles* all the data
        row_number() OVER  (ORDER BY random()) AS rn
    FROM 
        dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
    dataset1
SET
    column4 = shuffled_data.column1
FROM
    shuffled_data
    JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
    dataset1.id = original_keys.id ;

Note that the trick is performed by means of:

row_number() OVER (ORDER BY random()) AS rn

The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1. These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.

CHECKS

We can check again:

SELECT count(DISTINCT column4) FROM dataset1 ;

| count |
| ----: |
|    20 |

SELECT * FROM dataset1 ;

 id | column4       
--: | :-------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011

ALTERNATIVE

Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:

UPDATE
    dataset1
SET
    column4 = shuffled_data.column1
FROM
    (SELECT 
        column1,
        row_number() OVER  (ORDER BY random()) AS rn
    FROM 
        dataset2
    ) AS shuffled_data
    JOIN 
    (SELECT 
        id, 
        row_number() OVER  () AS rn
    FROM 
        dataset1
    ) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
    dataset1.id = original_keys.id ;

And again...

SELECT * FROM dataset1;

 id | column4       
--: | :-------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016

You can check the whole setup and experiment at dbfiddle here

NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.

Case 2: number of rows in dataset1 > rows in dataset2

In this case, values for column4 can be repeated several times.

The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:

CREATE FUNCTION random_column1() 
    RETURNS TEXT
    VOLATILE      -- important!
    LANGUAGE SQL
AS
$$
    SELECT
        column1
    FROM
        dataset2
    ORDER BY
        random()
    LIMIT
        1 ;
$$ ;

And use it to update:

UPDATE
    dataset1
SET
    column4 = random_column1();

This way, some values from dataset2 might not be used at all, whereas others will be used more than once.

dbfiddle here

这篇关于从另一个表填充随机数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从另一个表填充随机数据 [英] Populate random data from another table

问题描述

情况1：数据集1中的行数<==数据集2中的行

案例2：数据集1中的行数>数据集2中的行

Case 1: number of rows in dataset1 <= rows in dataset2

Case 2: number of rows in dataset1 > rows in dataset2

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从另一个表填充随机数据 [英] Populate random data from another table

问题描述

情况1：数据集1中的行数<==数据集2中的行

案例2：数据集1中的行数>数据集2中的行

Case 1: number of rows in dataset1 <= rows in dataset2

Case 2: number of rows in dataset1 > rows in dataset2

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭