Redshift:使用来自另一个表的随机数据更新或插入列中的每一行 [英] Redshift: Update or Insert each row in column with random data from another table

查看:232
本文介绍了Redshift:使用来自另一个表的随机数据更新或插入列中的每一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

update testdata.dataset1
   set abcd = (select abc 
               from dataset2
               order by random()
               limit 1
              ) 

这样做只会使表dataset2中的一个随机条目填充到dataset1表的所有行中.

Doing this only makes one random entry from table dataset2 is getting populated in all the rows of dataset1 table.

我需要的是生成从dataset2表到dataset1表的随机条目的每一行.

What I need is to generate each row with random entry from dataset2 table to dataset1 table.

注意:dataset1可以大于dataset2.

推荐答案

查询1

您应该将abcd传递到子查询中,以防止优化".

You should pass abcd into your subquery to prevent "optimizing".

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                ORDER BY random()
                LIMIT 1
               );

SQL提琴

查询2

下面的查询在纯PostgreSQL上应该更快.

The query below should be faster on plain PostgreSQL.

UPDATE dataset1
    SET abcd = (SELECT abc
                FROM dataset2
                WHERE abcd = abcd
                OFFSET floor(random()*(SELECT COUNT(*) FROM dataset2))
                LIMIT 1
               );

SQL提琴

但是,正如您所报告的,在列式存储Redshift上不是这种情况.

However, as you have reported, it is not the case on Redshift, which is a columnar storage.

查询3

在单个查询中从dataset2获取所有记录将比一次一个地获取记录更为有效.让我们测试一下:

Fetching all the records from dataset2 in a single query would be more efficient than fetching records one by one. Let's test:

UPDATE dataset1 original
SET abcd = fake.abc FROM 
              (SELECT ROW_NUMBER() OVER(ORDER BY random()) AS id, abc FROM dataset2) AS fake
               WHERE original.id % (SELECT COUNT(*) FROM dataset2) = fake.id - 1;

SQL小提琴

请注意,整数id列应存在于dataset1中.
同样,对于大于dataset2中的记录数的dataset1.idabcd是可以预测的.

Note that the integer id column should exist in dataset1.
Also, for dataset1.id's that are greater than the number of records in dataset2, abcd's are predictable.

查询4

让我们在dataset1中创建整数fake_id列,用随机值预填充并在dataset1.fake_id = dataset2.id上执行连接:

Let's create the integer fake_id column in dataset1, prefill it with random values and perform join on dataset1.fake_id = dataset2.id:

UPDATE dataset1
SET fake_id = floor(random()*(SELECT COUNT(*) FROM dataset2)) + 1;  

UPDATE dataset1
SET abcd = abc
FROM dataset2
WHERE dataset1.fake_id = dataset2.id;

SQL小提琴

查询5

如果您不想将fake_id列添加到dataset1,让我们计算fake_id的即时":

If you don't want to add fake_id column to dataset1, let's calculate fake_id's "on the fly":

UPDATE dataset1
SET abcd = abc
FROM (
SELECT with_fake_id.id, dataset2.abc FROM 
(SELECT dataset1.id,  floor(RANDOM()*(SELECT COUNT(*) FROM dataset2) + 1) AS fake_id FROM dataset1) AS with_fake_id
JOIN dataset2 ON with_fake_id.fake_id = dataset2.id ) AS joined
WHERE dataset1.id = joined.id;

SQL提琴

性能

在普通PostgreSQL上,查询4似乎是最有效的.
我将尝试比较DC1.Large试用版实例的性能.

On plain PostgreSQL, query 4 seems to be the most efficient.
I'll try to compare performance on a trial DC1.Large instance.

这篇关于Redshift:使用来自另一个表的随机数据更新或插入列中的每一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆