在2TB的CSV中对帐号进行匿名化 [英] Anonymization of Account Numbers in 2TB of CSV's

查看:369
本文介绍了在2TB的CSV中对帐号进行匿名化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约2TB的CSV,其中第一个 2列包含两个ID号。这些需要匿名化,所以数据可以用于学术研究。匿名化可以是(但不一定是)不可逆的。这些不是医疗记录,所以我不需要最好的加密算法。



问题:



标准哈希算法使得字符串真的很长,但我必须做一堆ID匹配(即对于包含ID XXX的数据中的行的子集,...)来处理匿名数据,所以这是不理想的。有更好的方法吗?



例如,如果我知道有大约1000万个唯一的帐号,是有一个标准的方法使用整数集合[1:计算约束是数据可能在32核〜500GB的服务器计算机上匿名化。

$ b $

解决方案

我假设你想做一个单一的通行证,一个CSV ID为
数字作为输入,另一个CSV以匿名数字作为输出。我将
也假设唯一ID的数量在10
百万或更少的数量级。



这是我的想法最好使用一些完全任意的
一对一函数从一组ID号(N)到一组
去标识号(D)。这将更安全。如果你使用一些
类型的哈希函数,并且攻击者知道哈希是什么,那么N中的
数字可以在没有太多麻烦的情况下恢复,而不会遇到
字典攻击。相反,我建议一个简单的查找表:ID 1234567
映射到de-identified号码4672592等。通信将是
存储在另一个文件,没有该文件的对手不会是
能够做很多事情。



在一个机器上,如果你描述的是$ 1000或更少的记录,
这不是一个大问题。在伪Python中的草图程序:

  mapping = {} 
unused_numbers = list(range(10000000))

,而数据:
为记录中的每个ID号N读取记录

如果映射中为N:
D = mapping [N]
else:
D = choose_random(unused_numbers)
unused_numbers.del(D)
mapping [N] = D
在记录中用D替换N
write record

将映射写入查找表文件


I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.

The Question:

Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?

For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?

The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.

解决方案

I will assume that you want to make a single pass, one CSV with ID numbers as input, another CSV with anonymized numbers as output. I will also assume the number of unique IDs is somewhere on the order of 10 million or less.

It is my thought that it would be best to use some totally arbitrary one-to-one function from the set of ID numbers (N) to the set of de-identified numbers (D). This would be more secure. If you used some sort of hash function, and an adversary learned what the hash was, the numbers in N could be recovered without too much trouble with a dictionary attack. Instead I suggest a simple lookup table: ID 1234567 maps to de-identified number 4672592, etc. The correspondence would be stored in another file, and an adversary without that file would not be able to do much.

With 10 million or fewer records, on a machine such as you describe, this is not a big problem. A sketch program in pseudo-Python:

mapping = {}
unused_numbers = list(range(10000000))

while data:
    read record
    for each ID number N in record:
        if N in mapping:
            D = mapping[N]
        else:
            D = choose_random(unused_numbers)
            unused_numbers.del(D)
            mapping[N] = D
        replace N with D in record
    write record

write mapping to lookup table file

这篇关于在2TB的CSV中对帐号进行匿名化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆