如何使用另一个表中的随机行更新表的每一行 [英] How to update each row of a table with a random row from another table
问题描述
我正在构建我的第一个去标识化脚本,但我的方法遇到了问题.
I'm building my first de-identification script, and running into issues with my approach.
我有一个表 dbo.pseudonyms
,它的 firstname
列填充了 200 行数据.这列 200 行中的每一行都有一个值(没有一个为空).该表还有一个 id
列(整数,主键,非空),其中的数字为 1-200.
I have a table dbo.pseudonyms
whose firstname
column is populated with 200 rows of data. Every row in this column of 200 rows has a value (none are null). This table also has an id
column (int, primary key, not null) with the numbers 1-200.
我想要做的是,在一个语句中,重新填充我的整个 USERS
表,其中 firstname
数据从我的 化名
表格.
What I want to do is, in one statement, re-populate my entire USERS
table with firstname
data randomly selected for each row from my pseudonyms
table.
要生成用于挑选的随机数,我使用了 ABS(Checksum(NewId())) % 200
.每次我执行 SELECT ABS(Checksum(NewId())) % 200
时,我都会得到一个我正在寻找的范围内的数值,没有间歇性的不稳定行为.
To generate the random number for picking I'm using ABS(Checksum(NewId())) % 200
. Every time I do SELECT ABS(Checksum(NewId())) % 200
I get a numeric value in the range I'm looking for just fine, no intermittently erratic behavior.
但是,当我在以下语句中使用此公式时:
HOWEVER, when I use this formula in the following statement:
SELECT pn.firstname
FROM DeIdentificationData.dbo.pseudonyms pn
WHERE pn.id = ABS(Checksum(NewId())) % 200
我得到的结果非常断断续续.我会说大约 30% 的结果返回一个从表中挑选出来的名字(这是预期的结果),大约 30% 返回多个结果(这令人困惑,有没有重复的 id
列值),并且大约 30% 返回 NULL(即使 firstname
列中有 没有 空行)
I get VERY intermittent results. I'd say about 30% of the results return one name picked out of the table (this is the expected result), about 30% come back with more than one result (which is baffling, there are no duplicate id
column values), and about 30% come back with NULL (even though there are no empty rows in the firstname
column)
我确实为这个特定问题寻找了很长时间,但到目前为止都无济于事.我假设这个问题与使用这个公式作为指针有关,但我不知道如何做到这一点.
I did look for quite a while for this specific issue, but to no avail so far. I'm assuming the issue has to do with using this formula as a pointer, but I'd be at a loss how to do this otherwise.
想法?
推荐答案
为什么问题中的查询返回意外结果
您的原始查询选自 Pseudonyms
.服务器扫描表格的每一行,从该行中选取 ID
,生成一个随机数,将生成的数字与 ID
进行比较.
Your original query selects from Pseudonyms
. Server scans through each row of the table, picks the ID
from that row, generates a random number, compares the generated number to the ID
.
当特定行的生成编号碰巧与该行的 ID
相同时,该行将在结果集中返回.很可能偶然生成的数字永远不会与ID
相同,并且生成的数字与ID
多次重合.
When by chance the generated number for particular row happen to be the same as ID
of that row, this row is returned in the result set. It is quite possible that by chance generated number would never be the same as ID
, as well as that generated number coincided with ID
several times.
更详细一点:
- 服务器选择具有
ID=1
的行. - 生成一个随机数,比如
25
.为什么不?一个不错的随机数. - 是
1 = 25
吗?否 => 不返回该行. - 服务器选择具有
ID=2
的一行. - 生成一个随机数,比如
125
.为什么不?一个不错的随机数. - 是
2 = 125
吗?否 => 不返回该行. - 等等...
- Server picks a row with
ID=1
. - Generates a random number, say
25
. Why not? A decent random number. - Is
1 = 25
? No => This row is not returned. - Server picks a row with
ID=2
. - Generates a random number, say
125
. Why not? A decent random number. - Is
2 = 125
? No => This row is not returned. - And so on...
示例数据
DECLARE @VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE @VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);
INSERT INTO @VarUsers (UserName)
SELECT TOP(1000)
'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
INSERT INTO @VarPseudonyms (PseudonymName)
SELECT TOP(200)
'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
表 Users
有 1000 行,每行具有相同的 UserName
.表 Pseudonyms
有 200 行,具有不同的 PseudonymNames
:
Table Users
has 1000 rows with the same UserName
for each row. Table Pseudonyms
has 200 rows with different PseudonymNames
:
SELECT * FROM @VarUsers;
ID UserName
-- --------
1 UserName
2 UserName
...
999 UserName
1000 UserName
SELECT * FROM @VarPseudonyms;
ID PseudonymName
-- -------------
1 PseudonymName1
2 PseudonymName2
...
199 PseudonymName199
200 PseudonymName200
第一次尝试
起初我尝试了一种直接的方法.对于 Users
中的每一行,我想从 Pseudonyms
中随机获取一行:
At first I tried a direct approach. For each row in Users
I want to get one random row from Pseudonyms
:
SELECT
U.ID
,U.UserName
,CA.PseudonymName
FROM
@VarUsers AS U
CROSS APPLY
(
SELECT TOP(1)
P.PseudonymName
FROM @VarPseudonyms AS P
ORDER BY CRYPT_GEN_RANDOM(4)
) AS CA
;
事实证明优化器太聪明了,这产生了一些随机的,但每个User
的相同PseudonymName
,这不是我所期望的:
It turns out that optimizer is too smart and this produced some random, but the same PseudonymName
for each User
, which is not what I expected:
ID UserName PseudonymName
1 UserName PseudonymName181
2 UserName PseudonymName181
...
999 UserName PseudonymName181
1000 UserName PseudonymName181
因此,我稍微调整了这种方法并首先为 Users
中的每一行生成一个随机数.然后,我使用生成的数字使用 CROSS APPLY
为 Users
中的每一行查找带有此 ID
的 Pseudonym
.
So, I tweaked this approach a bit and generated a random number for each row in Users
first. Then I used the generated number to find the Pseudonym
with this ID
for each row in Users
using CROSS APPLY
.
CTE_Users
有一个额外的列,其中包含从 1 到 200 的随机数.在 CTE_Joined
中,我们为每个 从
.最后我们Pseudonyms
中选择一行用户UPDATE
原始Users
表.
CTE_Users
has an extra column with random number from 1 to 200. In CTE_Joined
we pick a row from Pseudonyms
for each User
.
Finally we UPDATE
the original Users
table.
最终解决方案
WITH
CTE_Users
AS
(
SELECT
U.ID
,U.UserName
,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
FROM @VarUsers AS U
)
,CTE_Joined
AS
(
SELECT
CTE_Users.ID
,CTE_Users.UserName
,CA.PseudonymName
FROM
CTE_Users
CROSS APPLY
(
SELECT P.PseudonymName
FROM @VarPseudonyms AS P
WHERE P.ID = CAST(CTE_Users.rnd AS int)
) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;
结果
SELECT * FROM @VarUsers;
ID UserName
1 PseudonymName41
2 PseudonymName132
3 PseudonymName177
...
998 PseudonymName60
999 PseudonymName141
1000 PseudonymName157
这篇关于如何使用另一个表中的随机行更新表的每一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!