Python是否适用于大型Find&更换操作? [英] Is Python Suitable for Large Find & Replace Operations?

查看:57
本文介绍了Python是否适用于大型Find&更换操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这就是场景:


你有数百GB的数据......甚至可能是一个太字节或两个b
。在此数据中,您拥有关于贵公司客户的私人敏感信息(US

社会安全号码)。贵公司已经生成了自己独特的ID号来代替社会安全号码。


现在,管理层希望IT人员能够通过旧数据并且

用尽可能多的新ID号替换SSN。你有一个制表符

分隔的txt文件,它将SSN映射到新的ID号。这些数字对中有500,000美元。什么是最有效的方式来接受这个?b $ b接近这个?我以前做过小规模的查找和替换程序,

但是它的规模比我习惯的要大。


任何建议关于如何处理这个问题非常感谢。

解决方案

rbt写道:

这里的情景:

你有数百GB的数据......甚至可能是一个太字节或两个数据。在此数据中,您拥有关于贵公司客户的私人敏感信息(美国社会安全号码)。贵公司已经生成了自己独特的ID号来代替社会安全号码。

现在,管理层希望IT人员能够通过旧数据来替换尽可能多的具有新ID号的SSN尽可能。您有一个制表符
分隔的txt文件,它将SSN映射到新的ID号。这些数字对中有500,000个。什么是最有效的方法来解决这个问题?我以前做过小规模的查找和替换程序,
但是它的规模比我习惯的要大。

关于如何处理这个的任何建议都非常感谢。




我只是把它扔掉了。我对

算法没有多少经验,但在搜索多个目标时,Aho-Corasick

自动机可能是合适的。

http://hkn.eecs.berkeley.edu / ~dyoo / python / ahocorasick /

祝你好运!


-

罗伯特Kern
rk***@ucsd.edu


"在草地长得很高的地狱之地

梦想的坟墓是否已经死亡。

- Richard Harter


rbt< rb*@athop1.ath.vt.edu>写道:

你有数百GB的数据......甚至可能是一个太字节或两个。在此数据中,您拥有关于贵公司客户的私人敏感信息(美国社会安全号码)。贵公司已经生成了自己独特的ID号来代替社会安全号码。

现在,管理层希望IT人员能够通过旧数据来替换尽可能多的具有新ID号的SSN尽可能。


我要做的第一件事是快速健全检查。写一个简单的
Python脚本,将其输入复制到其输出,只需最少的处理。以下内容:


for sys.stdin.readline()中的行:

words = line.split()

打印文字


现在,在你的系统数据上运行它并计算时间。

将时间乘以1000(假设是一个太字节) (实际数据),现在你已经对实际工作需要多长时间有一个粗略的下限。

如果你发现这需要数周的处理时间那时候,你可能想要b $ b重新考虑一下这个计划。最好发现,现在这几个星期进入项目:-)

你有一个制表符分隔的txt文件,它将SSN映射到新的ID数字。这些数字对中有500,000个。什么是最有效的方法来解决这个问题?我以前做过小规模的查找和替换程序,
但是这个程度比我习惯的要大。




显而易见首先要阅读SSN / ID地图文件并构建一个

字典,其中键是SSN,值是ID'。


下一步取决于您的数据。它是在文本文件中吗?一个SQL

数据库?其他一些格式。无论如何,你必须阅读

它,按照newSSN = ssnMap [oldSSN]的方式做一些事情,并且

写下来退出。


rbt< rb*@athop1.ath.vt.edu>写道:

现在,管理层希望IT人员通过旧数据,并尽可能多地用新ID号替换SSN。您有一个
制表符分隔的txt文件,用于将SSN映射到新的ID号。这些数字对中有500,000个。什么是最有效的方法来解决这个问题?我以前做过小规模的查找和替换程序,但是这个程度比我习惯的要大。




只是使用一个普通的python dict for the map,在一个系统上有足够的

ram(每对可能100个字节左右,所以地图将是50 MB)。

然后这只是扫描输入文件以找到

SSN'并在字典中查找它们的问题。


Here''s the scenario:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company''s clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I''m accustomed to.

Any suggestions on how to approach this are much appreciated.

解决方案

rbt wrote:

Here''s the scenario:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company''s clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I''m accustomed to.

Any suggestions on how to approach this are much appreciated.



I''m just tossing this out. I don''t really have much experience with the
algorithm, but when searching for multiple targets, an Aho-Corasick
automaton might be appropriate.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Good luck!

--
Robert Kern
rk***@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter


rbt <rb*@athop1.ath.vt.edu> wrote:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company''s clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible.
The first thing I would do is a quick sanity check. Write a simple
Python script which copies its input to its output with minimal
processing. Something along the lines of:

for line in sys.stdin.readline():
words = line.split()
print words

Now, run this on a gig of your data on your system and time it.
Multiply the timing by 1000 (assuming a terabyte of real data), and
now you''ve got a rough lower bound on how long the real job will take.
If you find this will take weeks of processing time, you might want to
re-think the plan. Better to discover that now than a few weeks into
the project :-)
You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I''m accustomed to.



The obvious first thing is to read the SSN/ID map file and build a
dictionary where the keys are the SSN''s and the values are the ID''s.

The next step depends on your data. Is it in text files? An SQL
database? Some other format. One way or another you''ll have to read
it in, do something along the lines of "newSSN = ssnMap[oldSSN]", and
write it back out.


rbt <rb*@athop1.ath.vt.edu> writes:

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a
tab delimited txt file that maps the SSNs to the new ID numbers. There
are 500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs
before, but the scale of this is larger than what I''m accustomed to.



Just use an ordinary python dict for the map, on a system with enough
ram (maybe 100 bytes or so for each pair, so the map would be 50 MB).
Then it''s simply a matter of scanning through the input files to find
SSN''s and look them up in the dict.


这篇关于Python是否适用于大型Find&amp;更换操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆