python中大型数据集的最快性能元组是什么? [英] What is the fastest performance tuple for large data sets in python?

查看:154
本文介绍了python中大型数据集的最快性能元组是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在,我基本上是通过excel表。

Right now, I'm basically running through an excel sheet.

我有大约20个名字,然后我有50k个总值匹配到这20个名字之一,所以excel表是50k行长,列B显示任何随机值和列A显示20个名称之一。

I have about 20 names and then I have 50k total values that match to one of those 20 names, so the excel sheet is 50k rows long, column B showing any random value, and column A showing one of the 20 names.

我正在尝试为每个显示所有值的名称获取一个字符串。

I'm trying to get a string for each of the names that show all of the values.

Name A: 123,244,123,523,123,5523,12505,142... etc etc. 
Name B: 123,244,123,523,123,5523,12505,142... etc etc. 

现在,我创建了一个运行在excel表,检查字典中的名称是否已经准备就绪,如果是,那么它会执行

Right now, I created a dictionary that runs through the excel sheet, checks if the name is all ready in the dictionary, if it is, then it does a

strA = strA + "," + foundValue

然后,将strA插入该特定名称的字典。如果名称不存在,它将创建该字典密钥,然后将该值添加到该密钥。

Then it inserts strA back into the dictionary for that particular name. If the name doesn't exist, it creates that dictionary key and then adds that value to it.

现在,这一切都很好,但是大概15或20分钟,到目前为止,只有5k的值添加到字典,似乎随着时间的推移变慢,它一直在运行。

Now, this was working all well at first.. but it's been about 15 or 20 mins and it is only on 5k values added to the dictionary so far and it seems to get slower as time goes on and it keeps running.

我想知道是否有更好的方式来做这个或更快的方式来做到这一点。我正在考虑每1k建立一个新的字典值,然后将它们结合在一起,但最终将是50个字典,这听起来很复杂..虽然也许不是..我不知道,也许它可以更好地工作这样,这似乎不行。

I wonder if there is a better way to do this or faster way to do this. I was thinking of building new dictionaries every 1k values and then combine them all together at the end.. but that would be 50 dictionaries total and it sounds complicated.. although maybe not.. I'm not sure, maybe it could work better that way, this seems to not work.

我需要使用每个值之间的逗号来显示每个值的字符串。这就是为什么我现在正在做字符串的事情。

I DO need the string that shows each value with a comma between each value. That is why I am doing the string thing right now.

推荐答案

有很多可能导致程序运行缓慢的事情。

There are a number of things that are likely causing your program to run slowly.

python中的字符串连接在使用时可能非常低效与大字符串。

String concatenation in python can be extremely inefficient when used with large strings.


Python中的字符串是不可变的。这个事实经常在臀部上悄悄地叮咬新手的Python程序员。不变性赋予一些优点和缺点。在加号列中,字符串可以用作词典中的键,单个副本可以在多个变量绑定之间共享。 (Python自动共享一个和两个字符的字符串。)在负栏中,您不能说任何给定的字符串中的所有a更改为b。相反,您必须创建一个具有所需属性的新字符串。这种连续复制可能会导致Python程序的效率低下。

Strings in Python are immutable. This fact frequently sneaks up and bites novice Python programmers on the rump. Immutability confers some advantages and disadvantages. In the plus column, strings can be used as keys in dictionaries and individual copies can be shared among multiple variable bindings. (Python automatically shares one- and two-character strings.) In the minus column, you can't say something like, "change all the 'a's to 'b's" in any given string. Instead, you have to create a new string with the desired properties. This continual copying can lead to significant inefficiencies in Python programs.

考虑到您的示例中的每个字符串可能包含数千个字符,每次做一个连接,python必须将这个巨大的字符串复制到内存中以创建一个新对象。

Considering each string in your example could contain thousands of characters, each time you do a concatenation, python has to copy that giant string into memory to create a new object.

这将更有效:

strings = []
strings.append('string')
strings.append('other_string')
...
','.join(strings)

在你的情况下,存储一个巨大的字符串,它应该存储一个列表,你只需要添加每个匹配的列表,只有在最后才能使用 str.join 。

In your case, instead of each dictionary key storing a massive string, it should store a list, and you would just append each match to the list, and only at the very end would you do a string concatenation using str.join.

此外,打印到stdout也是臭名昭着的慢。如果您在大量50,000个项目循环的每次迭代中打印到stdout,则每次迭代都由无缓冲的写入stdout来阻止。考虑只打印每个第n个循环,或者可能写入文件(文件写入通常是缓冲的),然后从另一个终端拖尾文件。

In addition, printing to stdout is also notoriously slow. If you're printing to stdout on each iteration of your massive 50,000 item loop, each iteration is being held up by the unbuffered write to stdout. Consider only printing every nth iteration, or perhaps writing to a file instead (file writes are normally buffered) and then tailing the file from another terminal.

这篇关于python中大型数据集的最快性能元组是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆