如何使用另一个表中的随机行更新表的每一行 [英] How to update each row of a table with a random row from another table

查看:48
本文介绍了如何使用另一个表中的随机行更新表的每一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建我的第一个去标识化脚本,但我的方法遇到了问题.

I'm building my first de-identification script, and running into issues with my approach.

我有一个表 dbo.pseudonyms,它的 firstname 列填充了 200 行数据.这列 200 行中的每一行都有一个值(没有一个为空).该表还有一个 id 列(整数,主键,非空),其中的数字为 1-200.

I have a table dbo.pseudonyms whose firstname column is populated with 200 rows of data. Every row in this column of 200 rows has a value (none are null). This table also has an id column (int, primary key, not null) with the numbers 1-200.

我想要做的是,在一个语句中,重新填充我的整个 USERS 表,其中 firstname 数据从我的 化名 表格.

What I want to do is, in one statement, re-populate my entire USERS table with firstname data randomly selected for each row from my pseudonyms table.

要生成用于挑选的随机数,我使用了 ABS(Checksum(NewId())) % 200.每次我执行 SELECT ABS(Checksum(NewId())) % 200 时,我都会得到一个我正在寻找的范围内的数值,没有间歇性的不稳定行为.

To generate the random number for picking I'm using ABS(Checksum(NewId())) % 200. Every time I do SELECT ABS(Checksum(NewId())) % 200 I get a numeric value in the range I'm looking for just fine, no intermittently erratic behavior.

但是,当我在以下语句中使用此公式时:

HOWEVER, when I use this formula in the following statement:

SELECT pn.firstname 
FROM DeIdentificationData.dbo.pseudonyms pn 
WHERE pn.id = ABS(Checksum(NewId())) % 200

我得到的结果非常断断续续.我会说大约 30% 的结果返回一个从表中挑选出来的名字(这是预期的结果),大约 30% 返回多个结果(这令人困惑,有没有重复的 id 列值),并且大约 30% 返回 NULL(即使 firstname 列中有 没有 空行)

I get VERY intermittent results. I'd say about 30% of the results return one name picked out of the table (this is the expected result), about 30% come back with more than one result (which is baffling, there are no duplicate id column values), and about 30% come back with NULL (even though there are no empty rows in the firstname column)

我确实为这个特定问题寻找了很长时间,但到目前为止都无济于事.我假设这个问题与使用这个公式作为指针有关,但我不知道如何做到这一点.

I did look for quite a while for this specific issue, but to no avail so far. I'm assuming the issue has to do with using this formula as a pointer, but I'd be at a loss how to do this otherwise.

想法?

推荐答案

为什么问题中的查询返回意外结果

您的原始查询选自 Pseudonyms.服务器扫描表格的每一行,从该行中选取 ID,生成一个随机数,将生成的数字与 ID 进行比较.

Your original query selects from Pseudonyms. Server scans through each row of the table, picks the ID from that row, generates a random number, compares the generated number to the ID.

当特定行的生成编号碰巧与该行的 ID 相同时,该行将在结果集中返回.很可能偶然生成的数字永远不会与ID相同,并且生成的数字与ID多次重合.

When by chance the generated number for particular row happen to be the same as ID of that row, this row is returned in the result set. It is quite possible that by chance generated number would never be the same as ID, as well as that generated number coincided with ID several times.

更详细一点:

  • 服务器选择具有 ID=1 的行.
  • 生成一个随机数,比如 25.为什么不?一个不错的随机数.
  • 1 = 25 吗?否 => 不返回该行.
  • 服务器选择具有 ID=2 的一行.
  • 生成一个随机数,比如 125.为什么不?一个不错的随机数.
  • 2 = 125 吗?否 => 不返回该行.
  • 等等...
  • Server picks a row with ID=1.
  • Generates a random number, say 25. Why not? A decent random number.
  • Is 1 = 25 ? No => This row is not returned.
  • Server picks a row with ID=2.
  • Generates a random number, say 125. Why not? A decent random number.
  • Is 2 = 125 ? No => This row is not returned.
  • And so on...

这里是 SQL Fiddle 的完整解决方案

示例数据

DECLARE @VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE @VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);

INSERT INTO @VarUsers (UserName)
SELECT TOP(1000)
    'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;

INSERT INTO @VarPseudonyms (PseudonymName)
SELECT TOP(200)
    'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;

Users 有 1000 行,每行具有相同的 UserName.表 Pseudonyms 有 200 行,具有不同的 PseudonymNames:

Table Users has 1000 rows with the same UserName for each row. Table Pseudonyms has 200 rows with different PseudonymNames:

SELECT * FROM @VarUsers;
ID   UserName
--   --------
1    UserName
2    UserName
...
999  UserName
1000 UserName

SELECT * FROM @VarPseudonyms;
ID   PseudonymName
--   -------------
1    PseudonymName1
2    PseudonymName2
...
199  PseudonymName199
200  PseudonymName200

第一次尝试

起初我尝试了一种直接的方法.对于 Users 中的每一行,我想从 Pseudonyms 中随机获取一行:

At first I tried a direct approach. For each row in Users I want to get one random row from Pseudonyms:

SELECT
    U.ID
    ,U.UserName
    ,CA.PseudonymName
FROM
    @VarUsers AS U
    CROSS APPLY
    (
        SELECT TOP(1)
            P.PseudonymName
        FROM @VarPseudonyms AS P
        ORDER BY CRYPT_GEN_RANDOM(4)
    ) AS CA
;

事实证明优化器太聪明了,这产生了一些随机的,但每个User的相同PseudonymName,这不是我所期望的:

It turns out that optimizer is too smart and this produced some random, but the same PseudonymName for each User, which is not what I expected:

ID   UserName   PseudonymName
1    UserName   PseudonymName181
2    UserName   PseudonymName181
...
999  UserName   PseudonymName181
1000 UserName   PseudonymName181

因此,我稍微调整了这种方法并首先为 Users 中的每一行生成一个随机数.然后,我使用生成的数字使用 CROSS APPLYUsers 中的每一行查找带有此 IDPseudonym.

So, I tweaked this approach a bit and generated a random number for each row in Users first. Then I used the generated number to find the Pseudonym with this ID for each row in Users using CROSS APPLY.

CTE_Users 有一个额外的列,其中包含从 1 到 200 的随机数.在 CTE_Joined 中,我们为每个 Pseudonyms 中选择一行用户.最后我们UPDATE原始Users表.

CTE_Users has an extra column with random number from 1 to 200. In CTE_Joined we pick a row from Pseudonyms for each User. Finally we UPDATE the original Users table.

最终解决方案

WITH
CTE_Users
AS
(
    SELECT
        U.ID
        ,U.UserName
        ,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
    FROM @VarUsers AS U
)
,CTE_Joined
AS
(
    SELECT
        CTE_Users.ID
        ,CTE_Users.UserName
        ,CA.PseudonymName
    FROM
        CTE_Users
        CROSS APPLY
        (
            SELECT P.PseudonymName
            FROM @VarPseudonyms AS P
            WHERE P.ID = CAST(CTE_Users.rnd AS int)
        ) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;

结果

SELECT * FROM @VarUsers;
ID   UserName
1    PseudonymName41
2    PseudonymName132
3    PseudonymName177
...
998  PseudonymName60
999  PseudonymName141
1000 PseudonymName157

SQL 小提琴

这篇关于如何使用另一个表中的随机行更新表的每一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆